The Data Analyst’s Handbook
Dr. Saöid Boularouk
2
Table des matières
1 Preparation and Reflection 15
1.1 Understanding the Power of Data . . . . . . . . . 15
1.2 Data Analysis and Data Science . . . . . . . . . . 16
1.2.1 Data Analysis . . . . . . . . . . . . . . . . 16
1.2.2 Data Science . . . . . . . . . . . . . . . . 17
1.2.3 Business Analytics . . . . . . . . . . . . . 18
1.2.4 Comparison Between Business Analytics
and Data Science . . . . . . . . . . . . . . 19
1.3 Examples of Applications in Data Analysis and
Data Science . . . . . . . . . . . . . . . . . . . . . 20
1.3.1 Descriptive and Diagnostic Analysis . . . . 21
1.3.2 Predictive and Prescriptive Analytics . . . 21
1.3.3 Applications of Data Science . . . . . . . . 22
1.3.4 Summary of Applications . . . . . . . . . 22
1.4 The Role of the Data Analyst : Understanding
Distinctions . . . . . . . . . . . . . . . . . . . . . 22
1.4.1 Different Roles in Data Analysis . . . . . . 23
1.4.2 Key Responsibilities of a Data Analyst . . 24
1.5 Becoming a Project Manager as a Data Analyst . 25
1.5.1 Links Between Data Analyst and Project
Manager Roles . . . . . . . . . . . . . . . 27
1.5.2 Acting as a Detective . . . . . . . . . . . . 27
1.5.3 Instinct . . . . . . . . . . . . . . . . . . . 28
1.5.4 Analytical Skills and Thinking . . . . . . . 29
3
1.5.5 The Most Common Problems Encountered 32
1.6 Data Ecosystems . . . . . . . . . . . . . . . . . . 33
1.6.1 Data Collection . . . . . . . . . . . . . . . 34
1.6.2 Data Collection Sources . . . . . . . . . . 35
1.6.3 Best Practices for Data Collection . . . . . 36
1.6.4 The Classic Data Life Cycle . . . . . . . . 37
2 From Data to Decision 39
2.1 Data-driven Decision Making : A Strategic and
Methodical Approach . . . . . . . . . . . . . . . . 39
2.1.1 The Data Lifecycle : A Methodical Process 40
2.2 The Importance of Contextualization . . . . . . . 41
2.2.1 Some Recommendations for Effective
Contextualization . . . . . . . . . . . . . . 42
2.3 Data Analysis Lifecycle . . . . . . . . . . . . . . . 43
2.3.1 Phases of the Data Analysis Lifecycle . . . 43
2.3.2 Email Exchange Scenario for a Study on
the Profitability of Bicycles . . . . . . . . 46
2.3.3 Estimated Duration of the Project . . . . 50
2.3.4 Role Distribution . . . . . . . . . . . . . . 50
3 Skills and Tools 53
3.1 Data Limitations . . . . . . . . . . . . . . . . . . 53
3.1.1 Incomplete Data . . . . . . . . . . . . . . 54
3.1.2 Poorly Organized Data . . . . . . . . . . . 54
3.1.3 Corrupted Data . . . . . . . . . . . . . . . 54
3.2 Positioning the Data Analyst within the Organi-
zation . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.1 Adherence to Best Practices . . . . . . . . 59
3.2.2 Meetings . . . . . . . . . . . . . . . . . . . 63
3.3 Tools of the Data Analyst . . . . . . . . . . . . . 66
3.4 Spreadsheets . . . . . . . . . . . . . . . . . . . . . 67
3.4.1 Formulas . . . . . . . . . . . . . . . . . . . 67
3.4.2 Error Detection Tips . . . . . . . . . . . . 68
3.5 Structured Query Language (SQL) . . . . . . . . 68
4
3.6 R and Python for Data Analysts . . . . . . . . . 69
3.6.1 R . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.2 Python . . . . . . . . . . . . . . . . . . . . 70
3.7 Data Analyst Tools . . . . . . . . . . . . . . . . . 71
3.8 Spreadsheets . . . . . . . . . . . . . . . . . . . . . 72
3.8.1 Formulas . . . . . . . . . . . . . . . . . . . 72
3.8.2 Tips for Detecting and Avoiding Errors . . 72
3.9 SQL Language . . . . . . . . . . . . . . . . . . . 73
3.10 R and Python for Data Analysts . . . . . . . . . 73
3.10.1 R . . . . . . . . . . . . . . . . . . . . . . . 73
3.10.2 Python . . . . . . . . . . . . . . . . . . . . 74
3.11 Example of Project Workflow . . . . . . . . . . . 75
3.12 Project Team Meeting . . . . . . . . . . . . . . . 75
3.12.1 Participants . . . . . . . . . . . . . . . . . 75
3.12.2 Meeting Flow . . . . . . . . . . . . . . . . 75
3.12.3 Roles and Responsibilities . . . . . . . . . 76
3.12.4 Meeting Conclusion . . . . . . . . . . . . . 77
4 Data Collection 79
4.1 Choosing Data . . . . . . . . . . . . . . . . . . . 79
4.1.1 Types of Data . . . . . . . . . . . . . . . . 80
4.1.2 Collection of Secondary and Third-Party
Data . . . . . . . . . . . . . . . . . . . . . 82
4.1.3 Ethical Considerations in Data Collection 82
4.2 Data Reliability . . . . . . . . . . . . . . . . . . . 82
4.3 Data Ethics and Confidentiality . . . . . . . . . . 84
4.3.1 Data Ethics . . . . . . . . . . . . . . . . . 84
4.3.2 Data Confidentiality . . . . . . . . . . . . 85
4.4 Open Source Data . . . . . . . . . . . . . . . . . 86
4.5 Metadata . . . . . . . . . . . . . . . . . . . . . . 86
4.5.1 Metadata Repository . . . . . . . . . . . . 87
4.6 Data Integrity . . . . . . . . . . . . . . . . . . . . 87
4.6.1 Laws and Standards . . . . . . . . . . . . 88
4.7 Data Cleaning . . . . . . . . . . . . . . . . . . . . 88
4.8 Insufficient Data . . . . . . . . . . . . . . . . . . 89
5
4.9 Corrupted Data . . . . . . . . . . . . . . . . . . . 89
4.10 Example of a Bike Rental Project Workflow . . . 90
4.10.1 Data Collection . . . . . . . . . . . . . . . 90
4.10.2 Choosing Data Storage Format . . . . . . 91
4.10.3 Data Extraction and Transformation . . . 91
4.10.4 Data Cleaning and Preprocessing . . . . . 92
4.10.5 Data Storage and Access . . . . . . . . . . 92
4.10.6 Transition to Data Analysis . . . . . . . . 92
4.10.7 Transforming JSON to CSV in Python . . 92
5 Data Cleaning 93
5.1 Using a Sample . . . . . . . . . . . . . . . . . . . 93
5.1.1 Why Use a Sample ? . . . . . . . . . . . . 93
5.1.2 The Trade-offs of Sampling . . . . . . . . 94
5.1.3 How to Determine the Sample Size ? . . . 94
5.1.4 Precautions and Strategies . . . . . . . . . 94
5.2 Using a Sample . . . . . . . . . . . . . . . . . . . 95
5.2.1 Sampling Bias . . . . . . . . . . . . . . . . 95
5.2.2 Factors to Consider in Determining
Sample Size . . . . . . . . . . . . . . . . . 95
5.3 Sample Size Calculation . . . . . . . . . . . . . . 96
5.3.1 Sample Size Calculation Formula . . . . . 96
5.3.2 Practical Example . . . . . . . . . . . . . 96
5.4 Data Cleaning with Spreadsheets . . . . . . . . . 97
5.5 Examples of Data Cleaning in Excel and Google
Sheets . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5.1 Identification and Handling of Missing Data100
5.5.2 Removing Duplicates . . . . . . . . . . . . 100
5.5.3 Removing Unwanted Characters . . . . . . 100
5.6 Data Cleaning : Key Steps and Tools . . . . . . . 100
5.6.1 Identification and Handling of Missing Data101
5.6.2 Removing Duplicates . . . . . . . . . . . . 101
5.6.3 Correction of Inconsistencies and Typo-
graphical Errors . . . . . . . . . . . . . . . 102
5.6.4 Data Transformation . . . . . . . . . . . . 102
6
5.6.5 Removing or Imputing Missing Data . . . 103
5.7 Bike Rental Project Example . . . . . . . . . . . 103
5.7.1 Data Cleaning with R . . . . . . . . . . . 104
5.7.2 Removing Missing Values in R . . . . . . . 105
6 Aggregation and Interpretation of Data 107
6.1 Importance and Techniques of Data Aggregation . 107
6.2 Correlation and Causality . . . . . . . . . . . . . 108
6.3 Key Aggregation Methods . . . . . . . . . . . . . 109
6.4 Tools and Software for Data Aggregation . . . . . 110
6.4.1 Main Tools and Software Used . . . . . . . 111
6.4.2 Examples of Data Aggregation by Sector . 111
6.4.3 Aggregation in SQL . . . . . . . . . . . . . 113
6.4.4 Aggregation in Python . . . . . . . . . . . 114
6.4.5 Aggregation in R . . . . . . . . . . . . . . 115
6.5 Data Interpretation . . . . . . . . . . . . . . . . . 116
6.6 Mathematical Formulas in Data Analysis . . . . . 118
6.6.1 Practical Example . . . . . . . . . . . . . 121
6.6.2 Mathematics and Tools . . . . . . . . . . . 121
6.7 Example of the Bike Rental Project . . . . . . . . 122
6.7.1 Context . . . . . . . . . . . . . . . . . . . 122
6.7.2 Data Analysis by Kareem . . . . . . . . . 123
6.8 Analysis of Bike Rental Trends . . . . . . . . . . 126
6.8.1 1. Weekend Rentals . . . . . . . . . . . . . 126
6.8.2 2. Holiday Periods . . . . . . . . . . . . . 126
6.8.3 3. Peak Hours . . . . . . . . . . . . . . . . 127
6.8.4 4. Optimization of Bike Management . . . 128
6.8.5 Conclusion . . . . . . . . . . . . . . . . . . 128
7 Visualization and Communication 129
7.1 Data Visualization : An Essential Tool for Com-
munication . . . . . . . . . . . . . . . . . . . . . . 129
7.2 The Impact of Visualization on Communication
and Decision Making . . . . . . . . . . . . . . . . 131
7.3 Design Thinking . . . . . . . . . . . . . . . . . . 132
7
7.4 Good and Bad Graphic Representation . . . . . . 133
7.4.1 A good graphic representation is charac-
terized by : . . . . . . . . . . . . . . . . . 133
7.4.2 On the other hand, a bad graphic repre-
sentation has the following characteristics : 134
7.5 Visualization Methods . . . . . . . . . . . . . . . 134
7.5.1 The McCandless Method . . . . . . . . . . 135
7.5.2 Verification of the Junk Charts Trifecta by
Kaiser Fung . . . . . . . . . . . . . . . . . 136
7.6 Exploitation of Pre-Attentive Attributes in Data
Visualization . . . . . . . . . . . . . . . . . . . . 137
7.6.1 Visual Marks . . . . . . . . . . . . . . . . 138
7.6.2 Visual Channels . . . . . . . . . . . . . . . 138
7.7 Geomatics and Spatial Data Representation . . . 139
7.7.1 The Importance of Spatial Visualization . 139
7.7.2 Practical Applications of Spatial Visuali-
zation . . . . . . . . . . . . . . . . . . . . 140
7.8 Design Principles . . . . . . . . . . . . . . . . . . 140
7.9 Titles and Labels . . . . . . . . . . . . . . . . . . 142
7.10 Relevance of Visualization . . . . . . . . . . . . . 143
7.11 Creating an Effective Dashboard . . . . . . . . . . 145
7.12 Choosing the Right Chart . . . . . . . . . . . . . 146
7.12.1 Histogram and Density Plot . . . . . . . . 146
7.12.2 Bar Chart . . . . . . . . . . . . . . . . . . 146
7.12.3 Line Chart and Pie Chart . . . . . . . . . 146
7.12.4 Best Practices . . . . . . . . . . . . . . . . 147
7.13 The Art of the Slide Show . . . . . . . . . . . . . 147
7.14 Presenting Results . . . . . . . . . . . . . . . . . 148
7.14.1 Tips for a Successful Presentation . . . . . 148
7.14.2 Anticipating Objections . . . . . . . . . . 148
7.14.3 Anticipating Questions . . . . . . . . . . . 149
7.14.4 Citations and Failure Cases . . . . . . . . 150
7.15 Example of the Bike Rental Project . . . . . . . . 150
7.16 Analysis Code . . . . . . . . . . . . . . . . . . . . 150
7.17 Distribution of Bike Types . . . . . . . . . . . . . 151
8
7.18 Distribution of Trip Durations . . . . . . . . . . . 151
7.19 Number of Rides by Member Type . . . . . . . . 152
7.20 Number of Rides by Time of Day . . . . . . . . . 152
7.21 Top 10 Start Stations . . . . . . . . . . . . . . . . 153
7.22 Heatmap of Rides . . . . . . . . . . . . . . . . . . 153
8 Data Analysis and AI 155
8.1 Why Do We Analyze Data ? What We Are Loo-
king For and How to See Beyond the Numbers . . 155
8.1.1 The Goal of Data Analysis . . . . . . . . . 155
8.1.2 Can We See Beyond the Data ? . . . . . . 156
8.1.3 Can Data Analysis Prevent a Problem ? . 156
8.2 Artificial Intelligence in Data Analysis . . . . . . 157
8.2.1 Data Collection . . . . . . . . . . . . . . . 157
8.2.2 Data Preprocessing . . . . . . . . . . . . . 157
8.2.3 Data Analysis . . . . . . . . . . . . . . . . 158
8.2.4 Data Visualization . . . . . . . . . . . . . 158
8.2.5 Prediction and Modeling . . . . . . . . . . 159
8.3 Artificial Intelligence : Product of Data or Gene-
rator of Knowledge ? . . . . . . . . . . . . . . . . 159
8.3.1 The Fuel of Artificial Intelligence : Data . 159
8.3.2 AI as an Extension of Data Analysis . . . 160
8.3.3 The Philosophy of AI and Data : The
Question of Autonomy and Creation . . . 160
8.3.4 An Intelligence Based on Accumulation . . 161
8.4 Applications of AI in Data Analysis . . . . . . . . 161
8.5 Application of Artificial Intelligence to Analysis :
The Importance of Predictive Analysis . . . . . . 162
8.5.1 What is Predictive Analytics ? . . . . . . . 162
8.5.2 The Role of AI in Predictive Analytics . . 163
8.5.3 Examples of AI-powered Predictive Ana-
lytics Applications . . . . . . . . . . . . . 163
8.5.4 Limitations and Challenges . . . . . . . . 163
8.6 Prediction Models . . . . . . . . . . . . . . . . . . 164
8.6.1 Linear Regression . . . . . . . . . . . . . . 164
9
8.6.2 Logistic Regression . . . . . . . . . . . . . 165
8.6.3 Random Forests . . . . . . . . . . . . . . . 165
8.6.4 Artificial Neural Networks . . . . . . . . . 165
8.6.5 Support Vector Machines (SVM) . . . . . 166
8.6.6 Time Series Models : ARIMA and LSTM . 166
8.6.7 Applications and Limitations of Predic-
tive Models . . . . . . . . . . . . . . . . . 166
8.7 Prediction of Pastry Sales . . . . . . . . . . . . . 166
8.7.1 Dataset . . . . . . . . . . . . . . . . . . . 167
8.7.2 Python Code to Generate the Data . . . . 167
8.8 Dataset . . . . . . . . . . . . . . . . . . . . . . . 167
8.9 Python Code . . . . . . . . . . . . . . . . . . . . 168
8.10 Prediction using Linear Regression . . . . . . . . 169
8.10.1 Linear Regression Function . . . . . . . . 169
8.10.2 Example : Predicting Pastry Sales . . . . . 170
8.10.3 Python Code for Linear Regression . . . . 170
8.10.4 Results . . . . . . . . . . . . . . . . . . . . 171
8.10.5 Interpretation of Coefficients . . . . . . . . 172
8.10.6 Conclusion . . . . . . . . . . . . . . . . . . 173
8.11 Conclusion : The Profile of a Data Analyst and
Tips for Success . . . . . . . . . . . . . . . . . . . 173
10
Abstract
Since Antiquity, data has played a key role in understanding
and managing societies. The earliest evidence of its use dates
back to Egyptian and Babylonian civilizations, where it was em-
ployed to optimize agriculture, trade, and construction projects.
Over the centuries, its applications evolved, spanning from sta-
tistical mapping during the Renaissance to Florence Nightinga-
le’s groundbreaking contributions to public health in the 19th
century.
The 20th century marked a major transformation with the
rise of computing. Databases began to structure information on
a large scale, and statistical analysis became an indispensable
discipline. Today, in the era of the digital revolution, data analy-
sis is ubiquitous, offering sophisticated tools capable of revealing
patterns, predicting the future, and solving complex challenges.
Data is not just numbers ; it is a magical key that unlocks
unexpected perspectives. By deciphering problems and their ori-
gins, data unveils solutions that are often revolutionary. Data
analysis has become essential for guiding businesses in strate-
gic decision-making, maximizing revenue, reducing costs, and
forecasting trends. Data analysts are architects of the future,
illuminating the paths forward with their informed predictions.
It all begins with a question : “What do we want to know ?”
This book is designed as a roadmap. It guides you through the
initial steps of collecting and processing data to its interpreta-
tion and visualization. We will explore powerful tools, proven
techniques, and essential practices to transform raw data into
actionable insights.
The role of the data analyst has evolved to become a criti-
cal link in achieving success, whether in businesses, government
agencies, or nonprofit organizations. Mastering data means pos-
sessing the power to make informed decisions, optimize perfor-
mance, and build a sustainable future.
Data is not merely numbers ; it is a compass for navigating
a constantly changing world. This book is an invitation to delve
into this fascinating universe and fully harness the potential of
data to shape a better future.
12
All rights reserved. No part of this publication may be reproduced,
stored, or transmitted in any form or by any means—electronic, me-
chanical, photocopying, recording, scanning, or otherwise—without
the prior written permission of the publisher. It is illegal to copy this
book, display it on a website, or distribute it by any other means
without authorization.
This novel is entirely a work of fiction. The names, characters, and
incidents depicted are the product of the author’s imagination. Any
resemblance to actual persons, living or dead, events, or locations is
purely coincidental.
First Edition, 2023
13
14
Chapitre 1
Preparation and
Reflection : The First
Steps of Data Analysis
1.1 Understanding the Power of Data
Data lies at the heart of modern decision-making. Whether in
business, science, healthcare, or politics, it plays an essential role
in providing insights that inform strategic decisions. Therefore,
it is crucial to understand the different categories of data, as
they directly influence the analytical approach to adopt.
Data can be grouped into two main categories : quantita-
tive data and qualitative data. Quantitative data, such as num-
bers and statistics, is used to answer questions like "what,"
"how much," and "how often," enabling the measurement of
trends and making comparisons. In contrast, qualitative data
is non-numerical and focuses on context, perceptions, and hu-
man behaviors, helping to explore answers to "why" and "how"
questions.
One of the keys to in-depth analysis lies in combining these
two types of data. While quantitative data provides precise num-
bers, qualitative data adds context to better understand the rea-
15
16 Preparation and Reflection
sons behind those numbers, offering a more comprehensive and
strategic view of the situation.
Data can also be classified by size and complexity. Small
data, characterized by smaller and more specific volumes, is
often used for ad hoc and contextual decisions. This type of
data can be easily managed with simple tools like spreadsheets.
In contrast, big data represents vast datasets, often heteroge-
neous and collected over extended periods, requiring sophisti-
cated tools for analysis, such as Hadoop or Spark. Big data is
crucial for solving large-scale problems, like analyzing consumer
behavior or optimizing processes in large organizations.
The choice between small data and big data depends on the
context, the size of the data, and the analysis objectives. Small
data is easier to handle and suitable for simple analyses, while
big data is necessary for complex, large-scale challenges, enabling
the extraction of critical insights for strategic decision-making.
1.2 Data Analysis and Data Science
Data analysis, or Data Analytics, and data science, or Data
Science, are two closely related yet distinct disciplines that play
a central role in today’s digital age. These multidisciplinary
fields involve the use of statistical, computational, and analy-
tical techniques to transform large amounts of data into actio-
nable insights, enabling informed decision-making and solving
complex problems.
1.2.1 Data Analysis
Data analysis is a systematic process of collecting, orga-
nizing, and examining data to draw conclusions and support
strategic decision-making. This discipline relies on an analyti-
cal cycle that includes several steps : data collection, cleaning,
transformation, exploration, statistical modeling, and the inter-
pretation of results. For instance, in a market study, a data ana-
1.2. Data Analysis and Data Science 17
lyst might start by examining consumer purchasing behaviors
over several years and then use statistical methods to identify
seasonal trends and correlations between products bought toge-
ther.
Data analysts act like detectives, scrutinizing complex da-
tasets to uncover hidden patterns and make predictions. One
of the keys to successful data analysis is the ability to ask the
right questions. While exploring data, such as a company’s per-
formance metrics, an analyst might not only ask what happened
but also why it happened and how it might evolve in the future.
As Clive Humby, a renowned data scientist, famously sta-
ted, "Data is the new oil" [Humby, 2006]. This highlights the
growing importance of data in decision-making processes. Addi-
tionally, Tom Davenport, a professor at Babson College, empha-
sizes that "Data analytics is becoming a strategic imperative for
businesses" [Davenport, 2006]. These quotes illustrate how data
analysis has become indispensable for extracting actionable in-
sights and guiding decisions.
In summary, data analysis transforms raw information into
actionable insights through rigorous statistical methods. It leve-
rages tools such as descriptive analytics to understand historical
trends and predictive analytics to anticipate future outcomes.
1.2.2 Data Science
Data science goes beyond traditional analysis by aiming to
transform massive, often unstructured datasets into practical so-
lutions for complex problems. This multidisciplinary field com-
bines computer science, statistics, machine learning, and predic-
tive modeling to extract powerful insights from large quantities
of data, often sourced from diverse and disparate origins. For
example, data scientists might use machine learning algorithms
to predict consumer behavior online based on their past inter-
actions with a website.
The primary goal of data science is to ask questions that have
18 Preparation and Reflection
yet to be considered and to develop models that answer these
inquiries. Unlike data analysis, which focuses on understanding
existing data, data science seeks to create new hypotheses and
design models that forecast the future. For instance, predictive
models could be used to anticipate customer purchasing trends
or detect fraudulent transactions before they occur.
As Geoffrey Hinton, a pioneer in artificial intelligence, remar-
ked, "Artificial intelligence is the new electricity" [Hinton, 2016],
underscoring the foundational role of advanced technologies like
data science in solving complex problems. Drew Conway further
explains, "Data science is a blend of statistical skills, computing
expertise, and business acumen," emphasizing the interdiscipli-
nary nature of this field.
A fundamental aspect of data science is exploring data from
multiple perspectives and seeking innovative ways to analyze
and comprehend complex phenomena. By combining advanced
modeling techniques with massive data processing capabilities,
data scientists tackle challenges that traditional analytical me-
thods struggle to address.
1.2.3 Business Analytics
Business analytics focuses on applying analytical techniques
to help organizations make more informed strategic decisions.
It relies on mathematics and statistics to interpret a compa-
ny’s data and predict future outcomes. The goal is to optimize
decision-making within the organization, whether by improving
existing processes or identifying new opportunities.
There are several types of analytics used in business analy-
tics, each serving a distinct purpose :
Descriptive Analytics : This type of analysis aims to un-
derstand historical data and identify trends. For example,
a company might analyze past sales to identify seasonal
peaks and adjust inventory accordingly.
Predictive Analytics : It uses past data to forecast future
1.2. Data Analysis and Data Science 19
events. For instance, predictive analytics might estimate
future product sales based on previous trends and consu-
mer behaviors.
Diagnostic Analytics : This type of analysis is used to
identify the root cause of a problem. For example, if a
company experiences a decline in sales, diagnostic ana-
lytics could reveal whether this is due to poor inventory
management or ineffective advertising.
Prescriptive Analytics : It provides recommendations on
actions to take to achieve a goal. For example, it could
suggest strategies to maximize return on investment
(ROI) or reduce costs by optimizing internal processes.
These various forms of analytics enable businesses to bet-
ter understand their past, predict the future, and make deci-
sions based on objective data. Bernard Marr, a business ana-
lytics expert, notes, "Companies that master business analytics
make better decisions, faster, and achieve superior performance"
[Marr, 2015].
1.2.4 Comparison Between Business Analytics and
Data Science
Although Business Analytics and Data Science share certain
methods, their goals and applications differ. Business Analytics
primarily focuses on optimizing organizational decision-making
by using structured data to solve concrete and specific problems.
In contrast, Data Science is more exploratory and creative, ai-
ming to generate new insights from massive, often unstructured,
datasets to address complex problems.
As Davenport and Harris explain in their book *Com-
peting on Analytics*, "Companies that embrace data
analytics outperform those that do not prioritize it"
[Davenport and Harris, 2007a]. Business Analytics emphasizes
the practical application of results to improve organizational
performance, whereas Data Science focuses on creating predic-
20 Preparation and Reflection
tive models and identifying new questions and approaches. For
example, a Business Analytics professional might concentrate
on optimizing a supply chain, while a data scientist might aim
to predict consumer trends by analyzing data from various
online and mobile sources.
Both disciplines require skills in statistics and computing,
but Data Science generally demands more advanced expertise
in mathematical modeling and machine learning. It addresses
open-ended, less defined problems, while Business Analytics ta-
ckles more specific questions centered on improving immediate
business outcomes.
Figure 1.1 – Comparison Between Data Science and Data Analytics.
1.3 Examples of Applications in Data Analy-
sis and Data Science
This section explores concrete examples of applications for
different types of data analysis and data science. These cases
illustrate how these techniques are used to solve complex pro-
blems and optimize processes across various sectors.
1.3. Examples of Applications in Data Analysis and Data
Science 21
1.3.1 Descriptive and Diagnostic Analysis
Descriptive analysis summarizes the main characteristics
of a dataset, highlighting past trends. For instance, during
the COVID-19 pandemic, hospitals used descriptive analytics
to anticipate medical equipment needs. This approach allo-
wed them to avoid critical shortages by adjusting stock le-
vels based on observed trends in admissions and treatments
[Dupont and Martin, 2020].
Diagnostic analysis, on the other hand, identifies the under-
lying causes of events. A typical example involves manufacturing
companies that used this approach to detect reasons for produc-
tivity declines. By analyzing data on machine breakdowns and
work schedules, they were able to address issues in inventory ma-
nagement and maintenance, reducing downtime and improving
operational efficiency [Durand and Lefevre, 2019].
1.3.2 Predictive and Prescriptive Analytics
Predictive analytics uses historical data to anticipate future
events. For instance, in the banking sector, predictive models
are employed to identify fraudulent transactions in real-time,
preventing significant financial losses [Lambert, 2018]. Another
example from the telecommunications industry demonstrates
how predictive analytics estimates customer churn rates, en-
abling companies to implement preventive measures like special
offers or enhanced customer service [Benoit and Petit, 2020].
Prescriptive analytics, on the other hand, recommends speci-
fic actions to optimize a process. A striking example is Amazon’s
use of this approach to suggest more efficient delivery routes,
considering factors such as real-time traffic and customer pre-
ferences. This optimization reduces costs and enhances custo-
mer satisfaction [Smith and Johnson, 2019]. In the healthcare
sector, prescriptive systems can recommend personalized treat-
ments based on medical history and recent research, improving
patient outcomes [Thomas and Boucher, 2021].
22 Preparation and Reflection
1.3.3 Applications of Data Science
Data science combines advanced techniques to address com-
plex and poorly defined problems. A notable example is the
application of machine learning in the financial sector to detect
fraud. Sophisticated models analyze thousands of transactions
per second, identifying suspicious behaviors before they lead to
losses [Harris and Wang, 2020].
In another domain, Netflix uses machine learning algorithms
to recommend movies and series to its users by analyzing their
past viewing behaviors and overall preferences. This system en-
hances the user experience, increasing engagement and subscri-
ber loyalty [Gauthier and Martin, 2020].
Another example of data science’s power is the applica-
tion of predictive models to manage flood risks in Hous-
ton. By analyzing real-time weather and hydrological data,
authorities were able to issue early warnings, saving lives
and minimizing property damage during tropical storms
[Miller and Thompson, 2019].
1.3.4 Summary of Applications
The examples presented illustrate how each type of analy-
tics (descriptive, diagnostic, predictive, prescriptive) and data
science plays a crucial role in optimizing processes. Whether un-
derstanding past trends, predicting the future, diagnosing pro-
blems, or recommending actions, these tools enable organiza-
tions to make informed decisions, improving profitability and
efficiency [Dupuis and Roux, 2022].
1.4 The Role of the Data Analyst : Unders-
tanding Distinctions
Today, the title of "Data Analyst" often appears alongside
other job titles in the field of data analysis. While these roles
1.4. The Role of the Data Analyst : Understanding
Distinctions 23
may share similarities, each involves specific responsibilities and
skill sets. Understanding these distinctions is essential to grasp
the unique role of the Data Analyst within an organization’s
data ecosystem.
1.4.1 Different Roles in Data Analysis
Data analysis professionals can hold various titles, each with
distinct tasks and objectives. These roles, while complementary,
differ in scope and approach :
Business Analyst : Focuses on improving business pro-
cesses by analyzing sales, customer, and operational data
to help the company optimize performance and better
meet market needs.
Business Intelligence Analyst : Specializes in analyzing
financial and business data to help organizations make
strategic decisions based on detailed insights into mar-
kets, competition, and finances.
Data Analytics Consultant : Analyzes an organization’s
data management systems and recommends improve-
ments to enhance decision-making processes through bet-
ter data utilization.
Data Engineer : Responsible for preparing and integrating
data from various sources, ensuring quality and consis-
tency before analysis, thus facilitating the work of data
analysts.
Data Scientist : Uses advanced analysis and machine lear-
ning methods to identify complex patterns and trends,
solving problems and providing innovative, data-driven
solutions.
Data Specialist : Focuses on organizing, cleaning, and pre-
paring data to make it accessible and usable for analysis
and decision-making teams.
24 Preparation and Reflection
Operations Analyst : Concentrates on analyzing work-
flows and internal operations to identify improvement le-
vers for optimizing process efficiency and profitability.
While these roles may overlap, each has its specialty. The
Data Analyst, in particular, stands out for their ability to trans-
form data into actionable insights that guide organizational de-
cisions.
1.4.2 Key Responsibilities of a Data Analyst
The Data Analyst stands out with an approach focused on
data collection, cleaning, analysis, and communication. Their
primary responsibilities include :
1. Data Collection : The data analyst starts by gathering
the necessary information from various internal and ex-
ternal sources, such as databases, CSV files, surveys, or
online systems. They ensure the relevance, quality, and
integrity of the data for the upcoming analysis.
2. Data Cleaning and Preparation : Before any analy-
sis, the data analyst cleans the data by removing outliers,
handling missing values, and ensuring consistency to gua-
rantee reliable and actionable results.
3. Data Analysis : The analyst uses statistical methods
and machine learning techniques to identify trends, re-
lationships, and anomalies in the data. Their goal is to
answer specific questions, such as "Why have our sales
dropped ?" or "What are the key factors influencing cus-
tomer behavior ?"
4. Data Visualization : An essential part of the ana-
lyst’s role is making the results accessible to all stake-
holders. Using charts, dashboards, and dynamic reports,
they present the data in a clear and understandable way,
facilitating decision-making.
1.5. Becoming a Project Manager as a Data Analyst 25
5. Communication of Results : Beyond analysis, the
data analyst must effectively communicate their results
to stakeholders. This involves creating detailed reports
and presenting the findings in a concise and accessible
manner for non-experts.
6. Forecasting and Modeling : In certain situations, the
data analyst creates predictive models to anticipate fu-
ture trends. For example, they might develop models to
predict customer purchasing behaviors or estimate the
risk of a particular event.
1.5 Becoming a Project Manager as a Data
Analyst
A data analyst can indeed evolve into a project manager
role, a transition based on developing project management skills,
gaining relevant experience, and pursuing appropriate training.
Here are the key factors to consider for successfully making this
transition :
— Project Management Skills : To transition from data
analysis to project management, a data analyst must ac-
quire essential project management skills, such as plan-
ning, organization, and resource management. The ability
to coordinate multidisciplinary teams and oversee bud-
gets and timelines is crucial. A project manager must
also master risk management and quality management
methods.
— Practical Experience : Experience in project mana-
gement, even on a small scale, is fundamental. A data
analyst can start by overseeing internal data analysis pro-
jects and gradually take responsibility for larger projects.
This process allows them to become familiar with mana-
gement dynamics while continuing to leverage their ana-
lytical skills.
26 Preparation and Reflection
— Additional Training and Certifications : To so-
lidify their project management skills, a data ana-
lyst can undertake specific training such as certifica-
tions in PMP (Project Management Professional) or
PRINCE2. These certifications are widely recognized
and enhance the professional’s credibility in a project ma-
nagement role.
— Domain Knowledge : An effective project manager
must understand their field of activity, and this is equally
true for a data analyst looking to transition into project
management. Mastery of specific challenges in a given
sector, such as data projects or digital transformation,
can help design effective strategies and anticipate obs-
tacles.
— Agile Methodologies and Modern Tools : Adop-
ting agile methodologies, such as Scrum or Kanban,
can greatly facilitate the management of data projects.
A data analyst with a good understanding of agile prin-
ciples will be better able to manage projects flexibly, ad-
justing priorities and deliverables based on the evolving
needs of the client or stakeholders. Project management
tools like Trello, Jira, or Asana can also be useful for
organizing and tracking team progress.
— Leadership and Communication : A project mana-
ger must demonstrate leadership skills, be able to make
quick decisions, and inspire confidence within their team.
Additionally, advanced communication skills are crucial
for interacting with various stakeholders, from manage-
ment to operational teams. A data analyst should work
on their ability to lead, motivate, and persuade to effec-
tively manage a project.
— Career Goals : Transitioning to a project manager role
depends on the data analyst’s career objectives. If they
are motivated by a strategic role where team manage-
ment and project planning are essential, this transition
1.5. Becoming a Project Manager as a Data Analyst 27
becomes a viable option. However, this evolution requires
the ability to move beyond purely technical analysis to
embrace broader aspects of project development and im-
plementation.
1.5.1 Links Between Data Analyst and Project Mana-
ger Roles
Project management in the field of data analysis is not just
about tracking timelines and budgets. The data analyst holds
a major advantage : their ability to make decisions based on
precise data. This skill becomes crucial when evaluating project
performance and adjusting plans in real time. For instance, a
data analyst in a project manager role might use the results of
their analyses to adjust project priorities or identify inefficiencies
in processes.
A data analyst with project management skills can also play
a key role in the continuous optimization of project processes
using agile methodologies. By leveraging tools like KPIs (Key
Performance Indicators) and making data-driven adjustments,
an analyst becomes a project manager capable of making infor-
med decisions and leading strategically.
1.5.2 Acting as a Detective
As a data analyst, you play a key role in helping businesses
adopt data-driven decision-making processes, which is why it’s
important to understand how data plays a role in decision-
making.
Data + business knowledge = mystery solved.
To make a decision, a data analyst must always ask pertinent
questions in order to find the perfect balance :
— How can I define the success of this project ?
— What type of results are needed ?
— Who will be informed ?
— Am I answering the question that is being asked ?
28 Preparation and Reflection
— What is the urgency of the decision that needs to be
made ?
Key questions help analysts know if they’ve done enough to
move forward and ensure that teams have spent enough time on
each phase, avoiding starting the modeling process before the
data is ready.
You may need to rely on your own knowledge and experiences
more than usual. There simply isn’t enough time to thoroughly
analyze all the available data. But if you are working on a project
that involves a lot of time and resources, the best strategy is to
focus more on the data. As a data analyst, it’s your responsibility
to make the best possible choice. Throughout your career in data
analysis, you’ll likely combine data and knowledge in a million
different ways. And the more you practice, the better you’ll be
at finding that perfect balance.
The role of the data analyst in solving business mysteries is
similar to that of a detective in an investigation. As Sherlock
Holmes said, "Elementary, my dear Watson !" but in reality, sol-
ving data mysteries is not as simple as that. In fact, it requires
high-level analytical skills to ask the right questions and cross-
reference the right information. As W. Edwards Deming, the
pioneer of quality management, pointed out, "In God we trust ;
all others bring data." This quote reminds us of the importance
of basing decisions on tangible facts, not just on intuition or
conjecture.
In your role as a detective, you will leverage your natural
abilities in strategy, technical expertise, and data design. These
are extremely useful skills to have and need to be strengthened
further.
1.5.3 Instinct
Detectives and data analysts have a lot in common. Both
rely on facts and clues to make decisions. Both gather and exa-
mine evidence. Both talk to people who know part of the story.
1.5. Becoming a Project Manager as a Data Analyst 29
And both may follow leads to see where they lead. Whether
you’re a detective or a data analyst, your job is to follow steps
to gather and understand the facts. A data analyst might joke,
"A detective knows that a clue is a data point, but an analyst
knows that two clues don’t necessarily make a truth !" This bit
of humor highlights the importance of not jumping to conclu-
sions too quickly. And while instinct can be a good compass, it
should never replace data : "Intuition is a small voice, but data
speaks louder."
Instinct is the "feeling" that something is right. We pick up
these signals without even realizing it. It is an intuitive unders-
tanding of something with little or no explanation. It can be
helpful for making decisions in data analysis. It’s not always so-
mething conscious. Therefore, it is essential for data analysts to
focus on the data to ensure they are making informed decisions.
Ignoring data and relying on your own experience could bias
your decisions. But even worse, decisions based on instinct wi-
thout any data to support them can lead to mistakes. As Peter
Drucker, the management expert, said, "The best way to predict
the future is to create it." This quote emphasizes the importance
of relying on reliable data to guide our future decisions, rather
than relying on uninformed instincts.
1.5.4 Analytical Skills and Thinking
Reasoning or analytical thinking involves identifying and de-
fining a problem, then solving it by using data in an organi-
zed, step-by-step manner. This doesn’t prevent you from having
creative and critical thinking regarding your analyst methods to
achieve better results. Analytical thinking is a skill that every
data analyst must sharpen, much like a detective hones their ob-
servational talents. Charles Darwin emphasized the importance
of contextual understanding with this quote : "It is not the stron-
gest of the species that survive, nor the most intelligent, but the
one most responsive to change." This flexibility is essential for
30 Preparation and Reflection
interpreting data in context and adjusting strategies based on
new elements. Here are the five key aspects of analytical reaso-
ning : visualization, strategy, problem orientation, correlation,
and finally, a big-picture and detail-oriented mindset.
Visualization : In data analytics, visualization is the gra-
phical representation of information. Representations can
be maps, graphs, or charts. Visualization is important
because visuals can help data analysts understand and
explain information more effectively than words. Facts
and numbers become easier to see, and complex concepts
become easier to understand. Albert Einstein said, "If
you can’t explain it simply, you don’t understand it well
enough." This quote reminds us of the importance of ma-
king data accessible and understandable to everyone.
Data Strategy : When developing strategies, we know that
all our data is valuable and can help us achieve our goals.
Strategy also improves the quality and usefulness of the
data we collect. It helps data analysts visualize what they
want to achieve with the data and how they can do so. It
is crucial to define a strategy at the outset while staying
focused on the right path. As W. Edwards Deming
said, "In God we trust ; all others bring data."
Analytical Reasoning : Being able to identify correlations
between different data elements. A correlation is like a re-
lationship. You can find all kinds of correlations in data.
However, it is important to remember that correlation
does not imply causality. In other words, just because
two data points move in the same direction does not ne-
cessarily mean they are related. Mark Twain perfectly
summed up this idea with his humorous quote : "There
are three kinds of lies : lies, damned lies, and statistics."
This serves as a warning against oversimplifying data.
Problem-Oriented Approach : To identify, describe, and
solve problems, analysts ask a lot of questions to unders-
1.5. Becoming a Project Manager as a Data Analyst 31
tand the context. This helps improve communication and
save time while working on a solution. They may uncover
several other problems and solutions to avoid them in the
future. A problem-oriented approach helps keep the main
issue in focus throughout the project. Albert Einstein
also said, "If I had an hour to solve a problem, I would
spend 55 minutes framing the problem and 5 minutes sol-
ving it." This shows the importance of thoroughly unders-
tanding the question before diving into finding answers.
Big-Picture Thinking : This is an overview of the pro-
blem. If you only focus on individual pieces, you won’t be
able to see beyond them. Big-picture thinking helps you
step back and see possibilities and opportunities. This
can lead to new ideas or exciting innovations.
Detail-Oriented Thinking : This is a more detailed view.
Detail-oriented thinking involves understanding all the
necessary aspects of executing a plan.
Analytical skills are qualities and characteristics associated
with solving problems using facts. The five essential areas to de-
velop are : curiosity, contextual understanding, technical mind-
set, data design, and data strategy.
Curiosity : Is the desire to learn new things. Being curious
often leads to seeking new challenges and experiences,
which leads to acquiring knowledge. Funny Quote : "I’m
so curious I could get lost in Wikipedia information for
hours. But don’t worry, I’ll always come back to the main
question." - A data analyst in search of answers
Contextualization : Understanding context means gras-
ping the environment in which something exists or occurs.
It also involves grouping information into meaningful ca-
tegories.
Technical Mindset : The ability to break down complex
tasks into smaller elements and work with them in a lo-
gical and organized way.
32 Preparation and Reflection
Data Design : How you organize information. As a data
analyst, design usually refers to an actual database, but
also to the logical organization of data so that it is easily
accessible and interpretable.
Data Strategy : The management of people, processes,
and tools used in data analysis. This includes ensuring
that the right data is used, that the path to the solution
is clear, and that the right technologies are employed to
solve the problem.
1.5.5 The Most Common Problems Encountered
Data analysts face various complex challenges in their daily
work. Among the most commonly encountered problems, here
are six major categories :
1. Forecasting : This type of problem involves using his-
torical and current data to anticipate future trends. For
example, a company seeking to optimize its advertising
campaigns to attract new customers may ask data ana-
lysts to predict the potential impact of different adverti-
sing strategies.
2. Categorization : Categorization involves organizing
data into distinct groups based on common characteris-
tics. For example, a company aiming to improve custo-
mer experience might ask analysts to classify customer
service calls based on criteria such as type of request or
satisfaction level, helping identify trends and areas for
improvement.
3. Discovering Connections : Data analysts often seek to
identify relationships between different variables or enti-
ties. This allows them to combine information from di-
verse sources to solve complex problems and offer new
insights.
4. Pattern Recognition : Analyzing historical data is
commonly used to identify recurring patterns that help
1.6. Data Ecosystems 33
better understand past events and predict future beha-
viors. For example, detecting patterns in purchasing ha-
bits can help a company personalize its offers.
5. Anomaly Detection : This type of problem in-
volves spotting deviations from normal data trends. For
example, a wearable tech company might use data analy-
sis to identify anomalies in health metrics (e.g., abnormal
heart rate), which could indicate a health issue or device
malfunction.
6. Theme Identification : Beyond simple categorization,
theme identification involves grouping data into broader,
more meaningful concepts. This process is particularly
useful for analyzing qualitative data, such as user studies
where themes might include needs, attitudes, or specific
behaviors.
These various challenges provide data analysts with oppor-
tunities to apply their analytical skills to solve diverse problems
and provide valuable insights, enabling informed, strategic deci-
sions within the organization.
1.6 Data Ecosystems
Data ecosystems consist of a complex combination of data
infrastructures, people, communities, and organizations that ex-
ploit and benefit from the value generated by the data. These
elements interact to produce, manage, store, organize, analyze,
and share data. A robust data ecosystem is not just about using
technological tools ; it also relies on a company culture that va-
lues data as the main driver of strategic decision-making and
innovation. This Data Mindset plays a crucial role in how data
is used and integrated into organizational processes.
Data doesn’t just reside in servers or management systems ;
it is increasingly stored and processed in the cloud, a virtual
environment that allows access to data via the Internet. The
34 Preparation and Reflection
cloud infrastructure thus plays a central role in the flexibility
and accessibility of data, making the ecosystem more agile and
accessible. However, for a data ecosystem to reach its full po-
tential, adopting a Data Mindset becomes crucial.
A Data Mindset is about approaching challenges and oppor-
tunities by relying on evidence-based data rather than intuition
or emotions. In a constantly evolving data environment, this
mindset leads to informed decision-making, powered by precise
analysis and relevant questions. Businesses and individuals with
such a mindset constantly seek to understand why things hap-
pen rather than just accepting that they do. This analytical,
data-driven approach becomes the cornerstone of modern data
ecosystems.
The infrastructure of data ecosystems, including elements
like Big Data, the Internet of Things (IoT), and advanced ana-
lytics systems, increasingly requires data analysts who don’t just
collect and organize data, but also know how to analyze it cri-
tically to derive meaningful insights. Big Data, in particular,
which comes from diverse sources like social media, IoT sensors,
and online transactions, represents both a major challenge and a
great opportunity for analysts who know how to apply suitable
analytical models. In such a context, adopting a Data Mindset
becomes not just a key skill but also a catalyst for innovation
and growth.
1.6.1 Data Collection
Data collection is a fundamental process that allows for ans-
wering specific questions and providing valuable information for
decision-making. However, for the collected data to be valuable,
it is essential that they align with a clear data collection stra-
tegy and adhere to the principles of Data Mindset : collecting the
right data, at the right time, and transparently. Effective data
collection depends not only on the tools but also on the ability
to ask the right questions, understand the objectives, and know
1.6. Data Ecosystems 35
how to interpret the results to guide decision-making.
1.6.2 Data Collection Sources
Data sources are diverse and varied, each providing specific
information that feeds into the data ecosystem. Among the main
data collection sources are :
Archives : Internal archives are a crucial resource for any
organization. They contain historical data that, when
properly managed and analyzed, can provide valuable in-
sights to inform future decisions.
Surveys : Surveys allow data to be collected directly from
individuals, offering both qualitative and quantitative
perspectives on specific topics. The design of appropriate
questionnaires and careful management of the collection
process are essential to ensure the quality of the results.
Interviews : Interviews, whether individual or group-based,
are a data collection method that provides in-depth qua-
litative data and helps understand complex phenomena
in context.
Observations : Field observation provides direct unders-
tanding of behaviors or events in their natural environ-
ment. This approach, especially useful in qualitative stu-
dies, allows data to be collected without interfering with
the observed processes.
Experiments : Experiments involve manipulating variables
to test hypotheses. They generate quantitative data that,
when analyzed, can confirm or reject theories.
Social Media : Analyzing data from social media, such
as user posts and interactions, is becoming increasingly
critical to understand trends and public opinion. Well-
interpreted data from these sources can influence marke-
ting strategies and business decisions.
36 Preparation and Reflection
Web and URL : Online data collection methods such as
web scraping allow information to be extracted from web
pages, providing a rich source of both qualitative and
quantitative data.
Sensors : Sensors, used to collect real-time data, play an
increasingly important role in areas like health, the en-
vironment, and the Internet of Things (IoT). This data
can be analyzed to detect anomalies, predict trends, or
provide contextual insights.
1.6.3 Best Practices for Data Collection
Adopting a Data Mindset also involves following rigorous
practices in data collection to ensure quality and reliability. Here
are some essential best practices :
— Setting Clear Objectives : Before any collection, it
is important to define precise objectives to guide data
search.
— Strategic Planning : Planning helps identify the best
sources and collection methods while considering avai-
lable resources and organizational constraints.
— Data Protection : Respect data protection laws and
regulations, including anonymizing or securing sensitive
information.
— Data Validation : Ensure that the collected data is re-
liable and consistent by validating it throughout the pro-
cess.
— Ethical Data Collection : Data collection must be
conducted ethically, ensuring confidentiality and obtai-
ning consent from participants when necessary.
— Secure Storage : Data must be securely stored to
prevent any risk of leakage or loss.
1.6. Data Ecosystems 37
1.6.4 The Classic Data Life Cycle
Data goes through several stages during its life cycle, each
playing a role in its preparation, analysis, and use. This life cycle
can be broken down into several key stages :
Planning : Determine the types of data to collect, the
sources to use, and the management methods.
Creation : The initial generation of data.
Storage : Organization and management of data in secure
systems.
Processing : Preparation and cleaning of data for analysis.
Analysis : Applying analytical tools to extract insights.
Visualization : Creating graphical representations to com-
municate results.
Dissemination : Sharing results and insights with stake-
holders.
Archiving : Storing data for future use.
Destruction : Securely deleting data when no longer nee-
ded.
In summary, an effective Data Mindset well integrated into
the data ecosystem is a key success factor in an organizatio-
nal environment. It transforms data into a strategic asset and
enables an organization to reach new heights of efficiency, inno-
vation, and informed decision-making.
38 Preparation and Reflection
Chapitre 2
From Data to Decision
2.1 Data-driven Decision Making : A Strate-
gic and Methodical Approach
Data-driven decision making is a methodical approach that
relies on the analysis and interpretation of objective data to in-
form strategic choices. This process involves gathering relevant
data, conducting in-depth analysis, and drawing conclusions ai-
med at guiding decisions. Used across various fields such as busi-
ness, healthcare, and public administration, this approach aims
to minimize biases and improve the accuracy of decisions.
However, data alone is not enough to ensure informed deci-
sions. Their true value lies in their strategic interpretation, which
requires not only rigorous analysis but also the expertise of do-
main professionals. These experts can identify anomalies, ad-
just hypotheses, and interpret the results in a broader context
beyond the simple correlations observed in the data. Thus, data-
driven decision making is a collaborative effort, involving both
data analysts and stakeholders, including experts who have a
deep understanding of the specific issues of the project.
By combining technical skills and human expertise, this pro-
cess uncovers patterns and trends that raw data alone would
not reveal. However, to ensure the relevance of decisions, the
39
40 From Data to Decision
data must be analyzed considering contextual and ethical fac-
tors. Data-driven decision making should therefore not be isola-
ted but integrated into a broader reflection that includes expert
opinions and an evaluation of the social and ethical consequences
of choices.
2.1.1 The Data Lifecycle : A Methodical Process
The data-driven decision-making process follows a structured
lifecycle, composed of six major steps that guide the analysis in
a coherent and systematic manner :
Pose specific questions and clearly define the problem to be
solved ;
Prepare the data by collecting and organizing them opti-
mally ;
Process the data by cleaning and verifying their quality ;
Analyze the data to detect patterns, underlying relation-
ships, and significant trends ;
Share the analysis results in a clear and understandable way
for stakeholders ;
Act by making informed decisions based on the analysis re-
sults.
This lifecycle is part of a structured thinking approach, which
relies on the clarity of the problem, the logical organization of
information, the search for gaps in the data, and the evalua-
tion of possible options to solve the problem. This methodology
structures reasoning and maximizes the value of data at every
stage of the process.
Summary of Key Points :
— Reduction of Biases : Data-driven decision making re-
duces subjective biases by relying on measurable facts.
2.2. The Importance of Contextualization 41
— Collaboration and Expertise : The decision-making
process is collaborative, involving data analysts and ex-
perts with a deep knowledge of the domain.
— Identification of Trends : Data analysis uncovers hid-
den patterns, improving decision making.
— Importance of Ethics : The approach must include
ethical and contextual reflections to avoid biases in inter-
preting results.
2.2 The Importance of Contextualization
In the field of data analysis, one of the key skills is the ability
to contextualize the data. This not only involves staying objec-
tive but also recognizing all dimensions of the problem before
drawing conclusions. Context is crucial but also highly perso-
nal. Indeed, two analysts working on the same dataset, with
the same instructions and methodologies, may reach different
results. Why ? Because there are no universal contextual inter-
pretations. Each person approaches the data through their own
frame of reference, influenced by personal, cultural, and social
factors.
Even though the data collection process is rigorous, the in-
terpretation of the data can be biased. Conclusions can thus be
altered by personal biases, whether conscious or subconscious,
stemming from cultural, social norms or even market impera-
tives. If the analysis is not conducted objectively, it may lead to
erroneous or misleading conclusions.
To truly understand what the data is saying, the analyst
must ask fundamental questions about the who, what, where,
when, how, and why of the collected data. For example :
— Who collected the data and for what purpose ?
— What do they represent in their original context ?
— Where were they collected, and in what environment ?
— When were they collected ? Are the data up to date ?
42 From Data to Decision
— How were they processed and analyzed ?
— Why were these data chosen over others ?
These questions allow the analyst to clarify the context and
identify potential biases that may hinder the interpretation of
the data. Indeed, data analysis does not take place in a vacuum
but in a specific framework that must always be considered to
avoid erroneous conclusions.
Cognitive biases, such as confirmation bias, selection bias,
or anchoring bias, play a crucial role in how we perceive and
interpret information. For example, an analyst might uncons-
ciously prioritize data that confirms a pre-existing hypothesis,
neglecting data that contradicts it. These biases can influence
how results are presented, making contextualization even more
critical.
2.2.1 Some Recommendations for Effective Contex-
tualization
— Diversity of Sources : Always ensure that the data
comes from multiple and diverse sources to minimize se-
lection bias.
— Data Verification : Ensure rigorous verification of the
data before interpreting them, particularly by validating
their accuracy and relevance in the current context.
— Critical Approach : Adopt a critical and reflective ap-
proach throughout the analysis process, questioning as-
sumptions and testing multiple interpretation models.
— Collaboration with Experts : Work closely with do-
main experts to validate the context and methodological
choices.
In conclusion, contextualization of data is not a mere forma-
lity. It is an essential step to ensure the objectivity and accu-
racy of the analysis. By understanding the context in which the
data was collected and staying vigilant about cognitive biases,
the data analyst can produce more reliable and relevant results,
2.3. Data Analysis Lifecycle 43
thus contributing to informed decisions.
2.3 Data Analysis Lifecycle
The data lifecycle represents a dynamic and continuous pro-
cess, essential to ensuring the quality of decisions made from
collected data. Each step plays a crucial role in transforming
raw data into usable, reliable, and valuable information for stra-
tegic decision-making processes.
2.3.1 Phases of the Data Analysis Lifecycle
While there is no single or universal structure adopted by all
data analysts, common principles are shared across various data
processing methodologies. These fundamental steps ensure that
the data is properly managed and interpreted at each phase of
the cycle.
Here are examples of lifecycle models proposed by different
institutions and organizations :
— Google Data Analytics Certificate Lifecycle : This
model represents a classic yet highly structured process
for data analysis. It includes critical steps that promote
rigor and collaboration at each phase of the project. The
cycle consists of the following steps :
Ask : Define challenges, objectives, and business ques-
tions to solve.
Prepare : Generate, collect, store, and manage the ne-
cessary data.
Process : Clean the data and ensure their integrity.
Analyze : Explore the data, perform visualizations, and
extract insights.
Share : Communicate and interpret results clearly for
stakeholders.
44 From Data to Decision
Act : Implement the results to resolve the initial pro-
blem.
This lifecycle is designed to be iterative, allowing the ana-
lyst to return to the ask phase to refine or reformulate
hypotheses based on the results of analyses.
— EMC Corporation (Dell EMC) Lifecycle : This
cycle focuses on a more modeling and planning-oriented
approach before proceeding with the execution of deci-
sions. It consists of the following steps :
Discover : Identify the problem to solve and collect the
data.
Data Preprocessing : Clean and validate the data.
Model Planning : Design the appropriate analytical
framework.
Model Building : Develop predictive or descriptive
models.
Communicate Results : Share conclusions and recom-
mendations.
Operationalize : Implement the models in an operatio-
nal context.
— SAS Iterative Lifecycle : SAS’s approach is particu-
larly relevant in situations where it is necessary to ree-
valuate hypotheses and adjust actions based on results
obtained at each phase. The steps in this cycle include :
Ask : Formulate the problem and key questions.
Prepare : Collect, clean, and structure the data.
Explore : Perform a preliminary exploration of the data
to uncover initial patterns.
Model : Develop models to test hypotheses.
Implement : Apply the models in real-world conditions.
Act : Make decisions and implement.
2.3. Data Analysis Lifecycle 45
Evaluate : Measure impact and adjust models or stra-
tegies.
— Big Data Lifecycle (according to Big Data Funda-
mentals : Concepts, Drivers & Techniques) : This model
stands out by detailing specific steps related to massive
data. It separates some stages of data preparation and
cleaning, thus offering additional granularity in managing
large-scale data. This cycle consists of :
Business Analysis Evaluation : Define business goals
from available data.
Data Identification : Identify relevant data sources.
Data Acquisition and Filtering : Collect and filter
data based on needs.
Data Extraction : Extract relevant information from
sources.
Data Validation and Cleaning : Ensure data consis-
tency, reliability, and integrity.
Data Analysis and Presentation : Analyze and in-
terpret the data.
Decision Implementation : Implement decisions ba-
sed on analyzed data.
The crucial aspect to remember from these different models is
their adaptability and flexibility. Analysts can choose the model
that best fits the specific context of the project and the nature
of the data they are working with. Regardless of the chosen
approach, each lifecycle phase aims to structure the analysis
while ensuring that decisions based on data are as informed and
effective as possible.
46 From Data to Decision
2.3.2 Email Exchange Scenario for a Study on the Pro-
fitability of Bicycles
Scenario : The Impact of Decreased Profits
The bike rental company is encountering a series of issues.
The company’s profitability has significantly decreased, and the
management has noticed an increase in management fees related
to redistributing the bikes in different areas of the city. Addi-
tionally, there has been a rise in the number of damaged bikes,
further exacerbating the costs.
Email Exchange Between Mike and Jean Mike, the sales
manager, decides to initiate a study to better understand the
root causes of this drop in profitability. He contacts Jean, the
company manager, to propose an in-depth analysis of the com-
pany’s data. Here is the email exchange :
Email from Mike to Jean :
Hello Jean,
I hope all is well. I have taken note of the current
situation of our bike rental business. The decrease in
profitability, the rise in management fees, and the is-
sues related to bike maintenance lead me to propose
an in-depth study of the available data.
We store information across several platforms, in-
cluding data on online payments, subscriptions, bike
inventory, and the locations of charging stations. I
think it would be relevant to collect and analyze this
data to identify potential causes of the drop in pro-
fitability and define actions to improve our business
model.
The study could focus on key aspects such as stock
management, the costs related to bike redistribution,
and bike usage across the city. What do you think ?
2.3. Data Analysis Lifecycle 47
Best regards,
Mike
Jean’s Response to Mike :
Hello Mike,
Thank you for your message. I completely agree with
the idea of an in-depth study of the data. However,
I would like to draw your attention to one point : I
believe it would be risky to limit this study to just a
few key aspects like stock management or the costs
of redistributing bikes.
The available data could give us a broader and more
complete view. Other factors, such as rental prices,
competition, or even external elements like adver-
tising and visibility, may also influence our results.
It is important not to close any doors before fully
understanding what the data is telling us.
I suggest we do not limit the study from the start,
but instead adopt an open approach. This way, we
can better identify the various factors influencing our
profitability.
What do you think ?
Best regards,
Jean
This dialogue emphasizes the importance of gathering com-
plete information before making decisions and not limiting an
analysis to overly narrow assumptions. As an analyst, it is cru-
cial to examine all available data before concluding the causes
of the company’s issues.
After considering Jean’s concerns about not limiting the
study to predefined aspects, Mike, the sales manager, decides
to take action. He prepares a plan for a comprehensive analysis
of the company’s data and seeks the assistance of an external
consulting firm.
48 From Data to Decision
Subject : Action Plan for Profitability and Logistics
Study
Hello Jean,
Thank you for your feedback regarding the study on
various aspects of our business. I fully understand
your position and completely agree with not restric-
ting the analysis to the elements we already know.
The data must speak for itself, and for that, a com-
prehensive approach is essential.
I will contact the data analytics consulting firm to
conduct a full study. The idea would be to collect
and analyze the available data from several plat-
forms : sales, subscriptions, maintenance costs, cus-
tomer feedback, and of course, the GPS data of the
bikes and stations.
I will keep you updated on the first results.
Best regards,
Mike
Mike Contacts the Consulting Firm : Managed by Ka-
reem Mike reaches out to the consulting firm and explains the
project’s objectives to Kareem, an experienced data analyst. Ka-
reem is assigned to lead the analysis of these complex data sets.
The goal is to understand the underlying causes of the drop in
profitability and propose solutions based on objective data.
Subject : Data Analysis Project for Our Bike Rental
Business
Hello Kareem,
I would like to entrust you with a data analysis pro-
ject regarding our bike rental business in the city.
The aim is to identify the factors negatively affecting
the company’s profitability, particularly the bike re-
distribution fees, frequent repairs, and the impact of
subscription management.
2.3. Data Analysis Lifecycle 49
Here are the data we would like to explore : - Rental
data (number of rentals by geographic area, seaso-
nality, etc.). - Subscription and customer behavior
analysis. - The impact of logistics costs (bike redistri-
bution). - Profitability of the business by geographic
area.
You have access to all the necessary data, and we
entrust you with analyzing it in depth. Feel free to
ask us any questions if needed. It is crucial that you
provide us with a clear picture of what is happening
before proposing solutions.
Please let me know your schedule for taking on this
project. We need a first report within three weeks.
Best regards,
Mike
Subject : RE : Data Analysis Project for Our Bike
Rental Business
Hello Mike,
Thank you for the project description. I will start by
organizing the data collected from various platforms
and ensure we have access to all the necessary in-
formation. Access to data will be my initial priority,
particularly subscription and online payment tran-
saction data.
To give you an idea, I plan to follow these steps : 1.
Collect and verify data from the various platforms.
2. Prepare the datasets and clean the information.
3. Analyze trends in rentals, maintenance, and logis-
tics costs. 4. Develop preliminary results and propose
recommendations.
Depending on the access to data and the availability
of information, I believe I can provide a detailed first
report in about two weeks.
50 From Data to Decision
I will get back to you if I need more details or addi-
tional information.
Best regards,
Kareem
2.3.3 Estimated Duration of the Project
The estimated duration of the project depends on several
key factors. Given the size of the databases (sales, customer
behavior, logistics costs, maintenance), the variety of sources,
and the complexity of the required analysis, the project should
last around 3 to 4 weeks for a first detailed analysis.
This timeframe includes : Collecting data from all relevant
platforms, cleaning and preparing the data to ensure consistency
and quality, identifying trends and patterns that will explain the
profitability issues, and producing recommendations based on
the analysis results.
The goal is for Kareem to provide a first detailed report wi-
thin two weeks, allowing Mike’s team to have an initial overview
of the situation and make informed decisions on the actions to
take. This approach ensures that the objectives are defined by
the data itself, providing a more precise and objective perspec-
tive on the causes of the issues while maximizing the chances of
finding effective solutions.
2.3.4 Role Distribution
From the outset, and before forming the project team, it is
important to clearly define that the goal is to understand the
data rather than confirm preconceived ideas. This means that
team members must be open to surprises that data analysis
might reveal and that no actions should be proposed until trends
have been identified. Before moving forward with recommenda-
tions, Kareem could organize an exploratory analysis phase. This
would allow testing several hypotheses without being influenced
by the team’s initial ideas. This phase could last a few days or
2.3. Data Analysis Lifecycle 51
weeks depending on the available data and tools used.
In this scenario, Mike, as the sales manager, would be the
one to nominate the project team. As the head of commercial
operations and having given the go-ahead for the study, he is
responsible for defining the overall objectives and overseeing the
project. However, since data analysis is a highly specialized task,
Kareem, as an expert data analyst, will have a technical leader-
ship role in the project.
After concluding that the data is heterogeneous and spread
across different platforms and formats, Kareem determines that
it would be more sensible to include a data engineer in the team
to ensure efficient data normalization, cleaning, and integration.
The data engineer would be responsible for the technical ma-
nagement of the data infrastructure, ensuring that the data is
compatible, clean, and ready for analysis. Mike proposes Lee as
the data engineer.
While Mike might propose the initial team composition,
identifying the necessary roles, it is Kareem who, due to his
expertise in data analysis, will likely take the lead in the techni-
cal aspects of the project. Kareem will be the operational lead
regarding data analysis and manipulation, but Mike will remain
responsible for the project’s strategic direction, coordination,
and managing relationships with other departments and stake-
holders.
52 From Data to Decision
Chapitre 3
Key Skills and Tools for a
Data Analyst
The tools that data analysts use are essential for their work.
In this chapter, we will explore in detail the essential tools every
data analyst should be familiar with, starting by taking a closer
look at the data.
3.1 Data Limitations
Data is a valuable asset for informing decision-making, but
it has intrinsic limitations that a savvy analyst must unders-
tand and manage. Recognizing these constraints is essential for
drawing reliable conclusions and helping individuals and organi-
zations make wise choices. Although powerful, data can become
misleading when incomplete, poorly organized, or corrupted, lea-
ding to erroneous interpretations and potentially counterproduc-
tive decisions. To avoid these pitfalls, it is imperative to ensure
the completeness, consistency, and clarity of data before procee-
ding with its analysis.
53
54 Skills and Tools
3.1.1 Incomplete Data
When part of the necessary information is missing, it is chal-
lenging to guarantee solid conclusions. This does not mean that
such data should be ignored ; however, it is essential to be trans-
parent about the limitations of the analysis and, if possible, seek
complementary sources to enrich the dataset. Honesty about the
incomplete nature of the data maintains the credibility of the
analysis until additional data is available.
3.1.2 Poorly Organized Data
Poorly structured data, such as spreadsheets with incon-
sistent formats or missing headers, poses a risk to the integrity
of analyses. Understanding and harmonizing collection conven-
tions (types of separators, vertical or horizontal formats, units,
etc.) is essential to avoid interpretation errors. Thorough stan-
dardization ensures reliable and relevant comparisons between
different sources and teams, thereby enhancing the accuracy of
the results.
3.1.3 Corrupted Data
Corrupted data, marked by errors, inconsistencies, or dupli-
cations, represents a major obstacle for analysis. If not correc-
ted, it can skew conclusions, reduce productivity, and increase
costs. An effective cleaning process involves identifying, correc-
ting, or eliminating these flawed data points. Documenting the
adjustments made prevents misunderstandings and enhances the
transparency of the analysis, ensuring robust and reliable re-
sults.
In summary, proactive management of data limita-
tions—anticipating issues of incompleteness, structure, and cor-
ruption—maximizes their value and enables accurate conclu-
sions.
Remark 1. Data is valuable but has its limitations. As a data
3.2. Positioning the Data Analyst within the Organization 55
analyst, it is essential to recognize these limits and implement
practices and tools to manage them. By following these guide-
lines, you will be better prepared to work with quality data and
make informed decisions based on solid analyses.
3.2 Positioning the Data Analyst within the
Organization
The role of a data analyst is fundamental within an orga-
nization, not only for producing accurate analyses but also for
establishing clear and smooth communication with teams. Fin-
ding the best solutions begins with asking the right questions,
and the data analyst must constantly seek balance within the
team to facilitate the transfer of information and data.
To successfully carry out their missions, the analyst must
adapt their communication according to their audience. Four
questions guide this effort : who constitutes my audience ? What
are their prerequisites and level of understanding ? What infor-
mation is truly necessary ? How can I communicate clearly and
effectively ? By asking these questions, the data analyst refines
their responses to be understood and to effectively meet the
needs of their interlocutors. This approach helps establish a cli-
mate of trust, improve communication within the organization,
and tailor analyses to the expectations of partners.
To properly guide their work, the data analyst must be trans-
parent and realistic about the capabilities and limitations of the
available data. This means clearly framing the objectives and ex-
pectations of projects and reformulating requests if necessary to
ensure relevance. By managing these expectations, the analyst
helps partners understand what the data can provide while re-
maining clear about any constraints. Responsiveness also plays
a key role in communication quality. When faced with a question
or request, the analyst should respond without delay, thank the
audience for their patience, and indicate a return timeframe to
56 Skills and Tools
show commitment to providing precise answers.
As they gain experience, the data analyst’s responsibilities
evolve. Their skills allow them to take on increasingly strate-
gic projects, strengthening their impact within the company by
guiding data-driven decisions. Depending on the organization’s
structure, a data analyst may be an internal team member or
an external consultant. In both cases, they strive to collaborate
closely with teams to understand their specific needs and offer
tailored solutions.
Effective project management is essential to achieve efficient
and high-quality results. Here are some recommended rules for
successfully managing a project :
— Discuss objectives : Inquire about the type of results the
partner expects. Clearly understanding the project’s ob-
jectives is essential for guiding your analysis effectively.
— Feel authorized to say "no" : Have the confidence to say
"no" when justified, and provide a respectful explanation.
It is important to maintain the integrity of your work and
not compromise the quality of your analysis.
— Plan for the unexpected : Before starting a project, list
potential obstacles that could arise along the way. Proac-
tive planning can help you anticipate and overcome chal-
lenges more effectively.
— Know your project : Understand how your project fits into
the rest of the company and involve yourself to provide
as much information as possible. A deep understanding
of the project’s context is crucial for making informed
decisions.
— Start with words and visuals : Begin with a quick descrip-
tion and visual of what you are trying to convey. Visual
communication can simplify understanding of analysis re-
sults.
— Communicate often : Share notes on milestones, difficul-
ties, and project changes with stakeholders. Use these
notes to produce a clear and informative report to com-
3.2. Positioning the Data Analyst within the Organization 57
municate.
Balancing Needs and Expectations
As a data analyst, one of your essential missions is to focus on
your partners’ expectations. These individuals, having invested
time, resources, and interest in the projects you work on, often
rely on your analyses to meet their own needs and objectives. It
is therefore crucial that your work aligns with their expectations
and that you maintain clear and effective communication with
all team stakeholders.
This regular communication helps build trust in your work
and establish fruitful collaborative relationships. Your partners
will seek to understand the project’s overall goal, the resources
needed to achieve it, and the potential challenges you might
encounter. These exchanges are a valuable opportunity to clarify
expectations, discuss issues, and ensure everyone understands
the role of analysis in the project’s success.
Project managers, for example, are responsible for planning
and executing the project as a whole. They monitor team pro-
gress and ensure the project stays on track. You will need to
provide them with regular updates, identify the resources you
need, and promptly report any issues that may arise. You may
also collaborate with other team members, such as HR adminis-
trators, who will need to know your analysis criteria to better
organize data collection on employees. Similarly, you might work
alongside other data analysts, each with specific expertise.
In summary, knowing your partners and team members well
allows you to tailor your communication and provide them with
the necessary information to advance in their roles. You all work
together to provide the organization with vital insights that will
guide decisions and shape the company’s strategy.
Collaboration with the Organization’s Strategic Teams
The role of a data analyst involves collaborating with various
teams within the organization, each with specific needs and a
58 Skills and Tools
different approach to data. To work effectively, it is crucial to
understand each team’s expectations and adapt communication
and deliverables to their goals. Key partner teams include ma-
nagement, the customer-facing team, and the data science team.
Each group has distinct responsibilities and unique perspectives
that influence how they will use the analyses provided.
The management team, composed of senior executives such
as vice presidents and directors, focuses on the company’s stra-
tegic leadership. These leaders set organizational goals, develop
overarching strategies, and oversee their implementation. Due to
their responsibilities and limited time, they prefer concise ana-
lyses highlighting key conclusions and concrete implications for
the company’s strategy. It is therefore important to present clear
and directly actionable results, answering their specific questions
without unnecessary technical details. As Edward Tufte, a pio-
neer in data visualization, emphasizes, "Above all else, show the
data" — meaning it is essential to present information clearly
and concisely, without information overload.
The customer-facing team includes the organization’s em-
ployees who regularly interact with customers and gather their
feedback. Their role is to understand customer needs, define ex-
pectations, and relay feedback to internal teams. Often, this
team may express specific requests or hypotheses based on their
field observations. As an analyst, it is essential to approach these
requests objectively, allowing the data to reveal independent and
reliable insights without seeking to confirm unfounded expecta-
tions. Tom Davenport, in *Competing on Analytics*, asserts
that "Competing on analytics means out-thinking rather than
out-muscling the competition." This implies that the analyst
must provide data and analyses that address concrete questions,
avoiding the validation of preconceived hypotheses.
Lastly, the data science team plays a key role in advanced
data analysis and modeling to support predictive decisions and
strategic insights. Collaborating with this team allows for the
combination of various technical expertise. For example, you
3.2. Positioning the Data Analyst within the Organization 59
might analyze employee productivity data, while another ana-
lyst examines recruitment data ; these results would then be used
by a data scientist to model the impact of new practices on em-
ployee engagement. This close collaboration enables a compre-
hensive view and the exploration of new analytical angles, brin-
ging greater depth to the insights produced. As Stephen Few, a
data visualization expert, notes, "Dashboards are a means to an
end, not the end itself." This highlights the importance of data
presentation and extracting relevant conclusions, rather than
focusing on detailed reports that do not directly serve decision-
making.
In this collaborative dynamic, it is crucial that the data
analyst also serves as a translator between technical and non-
technical teams. As McKinsey Company points out in their
report on the role of the *Data Translator*, this professional
must understand business needs and translate data into actio-
nable information, thus facilitating strategic decision-making.
The ability to understand the expectations of different groups
and tailor analyses accordingly is essential for the success of
data analysis projects within the organization. Finally, Howard
Dresner, a business intelligence consultant, states that "Business
intelligence is not about tools or technology, but about trans-
forming data into useful information." This principle reflects the
importance of strategic data management to address business
challenges, rather than focusing solely on the tools used.
Thus, collaboration with these teams relies on targeted com-
munication and the ability to adapt analyses to the specific needs
of each group. By understanding and meeting each group’s ex-
pectations, you contribute effectively to the organization’s stra-
tegic goals and support decision-making at all levels.
3.2.1 Adherence to Best Practices
Best practices are recognized standards, methods, or ap-
proaches considered effective and appropriate in a particular
60 Skills and Tools
Figure 3.1 – Collaboration within the data team
field. In the context of data analysis, following best practices
is of great importance, as it ensures a more efficient, accurate,
and reliable data analysis process. Here are some advantages of
best practices :
— Increased consistency and reliability of results : By follo-
wing best practices, data analysts maintain a consistent
approach throughout their work. This means that the re-
sults of an analysis are reliable and reproducible, which
is essential for establishing the credibility of the analysis.
— Reduced errors : Best practices often include steps for
verifying and validating data, which helps reduce errors
in the analysis. This ensures that decisions based on the
results are made with accurate data.
— Better time management : Best practices generally in-
clude efficient processes that save time. By following well-
established methodologies, data analysts can avoid was-
ting time searching for errors or repeating steps.
— Greater clarity and understanding : Best practices en-
courage clear documentation and communication. This
3.2. Positioning the Data Analyst within the Organization 61
means that the results of the analysis are easier to un-
derstand for stakeholders, which is essential for informed
decision-making.
— Greater adaptability : While best practices provide gui-
delines, they are not rigid. They can be adapted accor-
ding to the specific needs of each project. This allows for
flexibility while maintaining high-quality standards.
— Better data security : Best practices often include secu-
rity measures to protect sensitive data. This ensures that
company and customer data is handled securely and in
compliance with regulations.
— Professional development : Adhering to best practices is a
valuable skill for data analysts. It can help them advance
in their careers, as employers seek professionals capable
of producing quality analyses.
— Increased credibility : By following best practices, data
analysts enhance their credibility with colleagues and su-
periors. This can lead to more significant work opportu-
nities and leadership roles within the organization.
Communication can be defined as a process of transmit-
ting information, ideas, thoughts, or meanings between a sender
and a receiver, through signs, symbols, or languages. According
to Shannon and Weaver (1948), this process includes sending,
receiving, and interpreting the message while accounting for po-
tential distortions that may disrupt understanding. This defi-
nition highlights the importance of efficiency in the transfer of
information and emphasizes that communication is not limited
to the emission of a message but to mutual understanding bet-
ween the involved parties.
The consequences of poor communication can be dramatic
in a professional environment and sometimes even fatal to a
project. For instance, a construction company misinterpreted
technical specifications due to poor communication between the
design team and the on-site team. This error caused significant
delays, additional costs, and damaged the relationship with the
62 Skills and Tools
client. Another example shows a company missing a major stra-
tegic opportunity simply by failing to communicate changes in
the project schedule clearly. A study by the Project Management
Institute reveals that 90% of project failures are linked to com-
munication problems. These incidents illustrate how poor com-
munication management can jeopardize a project and damage
a company’s reputation. As Peter Drucker said, "Poor manage-
ment of communication can cost much more than poor financial
management."
Before preparing a presentation or sending an email, it is es-
sential to understand your audience : what they already know,
what they need to learn, and how you can effectively transmit
your message. Aristotle, in his work Rhetoric, stated that "The
first principle of effective communication is to know your au-
dience." Such preparation will have a direct impact on the qua-
lity of your communication and will help build a relationship of
trust with your audience. By taking into account their expecta-
tions and knowledge level, you lay the foundation for solid and
successful communication. As Stephen Covey wrote : "Seek first
to understand, then to be understood."
Emails, while formal, do not necessarily need to be long. It is
crucial to write clearly and concisely, using complete sentences,
proper spelling, and appropriate punctuation. Messages should
focus on the essentials to avoid information overload. A study
by the University of California shows that short and targeted
messages are better received and generate higher engagement
than long messages. Conciseness is key to ensuring your emails
are not overlooked among other priorities. Blaise Pascal, in his
Provincial Letters, humorously noted : "I made this letter longer
only because I didn’t have the time to make it shorter."
Some situations require less pleasant communication, such
as announcing problems or delays. However, these exchanges
are not only acceptable but also necessary for the progress of
the project. As Carl Jung said, "What we resist persists." Igno-
ring or avoiding difficulties only exacerbates tensions. By clearly
3.2. Positioning the Data Analyst within the Organization 63
communicating the issues, offering solutions, and clarifying ex-
pectations, you can unlock situations that would otherwise re-
main stagnant. Even though the fastest response is not always
the most precise, it is possible to find a balance by understanding
others’ needs and setting clear expectations from the start.
In summary, effective communication starts with a clear un-
derstanding of the partners’ expectations. This allows you to
set realistic goals with specific expectations, clear timelines, and
agreed-upon follow-up reports. As George Bernard Shaw said,
"The biggest problem in communication is the illusion that it
has occurred." Once these foundations are established, you en-
sure that your message is not only heard but also understood
and taken into account.
3.2.2 Meetings
Meetings are a fundamental pillar of communication within
teams and with partners. They provide a platform to discuss
project progress, align objectives, and ensure coordination of
efforts. As Peter Drucker, the famous management theorist, em-
phasized : "Communication is the lubricant that keeps the com-
pany running, but poorly managed meetings are the sand that
grinds it and slows its movement." It is therefore essential to
structure each meeting properly to maximize its effectiveness.
A study conducted by the Harvard Business Review found
that managers spend an average of 23 hours per week in mee-
tings, but more than 67% of these meetings are deemed unne-
cessary or poorly prepared. This shows how poor meeting mana-
gement can negatively impact a company’s productivity. Thus,
preparation is crucial to avoid wasting attention and reducing
inefficiency.
When leading a meeting, preparation is essential. It is impor-
tant to know precisely what you are going to address, limit the
number of participants to facilitate discussions, and respect the
schedule to maintain a productive dynamic. Another key aspect
64 Skills and Tools
of meeting management is clarity of the agenda. According to
Socrates, "Real knowledge is knowing that you know nothing."
This maxim applies perfectly to meetings : a good facilitator
knows what they want to achieve from the meeting but remains
open to diverse viewpoints and suggestions, allowing for better
collective decision-making.
Additionally, always have note-taking tools on hand, such as
a notebook or laptop. "Words that are not written are soon for-
gotten," as the writer and communication philosopher Antoine
de Saint-Exupéry said. Active note-taking helps keep a record of
decisions, important ideas, and points to revisit, strengthening
participants’ buy-in to the decisions made.
During the meeting, it is crucial not to monopolize the
conversation, avoid unnecessary interruptions, and remain fo-
cused on the discussion’s goal. Encourage participation from all
team members by asking questions, seeking clarifications, and
requesting feedback on the points raised. A leader who does not
listen will not lead their team effectively, said John C. Maxwell,
an expert in leadership. It is essential that each team member
feels heard and valued, which will promote creativity and enga-
gement.
Every meeting should result in a clear decision, made by the
appropriate person. This reminds us of Aristotle’s famous prin-
ciple : "Man is by nature a social animal." As a social being,
humans are often more productive when they have clear and
defined interactions. If urgent decisions need to be made, an
immediate dedicated meeting should be scheduled. Sometimes,
poor communication can lead to catastrophic consequences. A
common anecdote in the business world involves a large tech-
nology company missing a major strategic opportunity due to
a poorly structured meeting and unclear communication about
priorities. Such mistakes can be costly, highlighting the impor-
tance of rigorous meeting management.
As a meeting facilitator, follow these steps to ensure its ef-
fectiveness :
3.2. Positioning the Data Analyst within the Organization 65
Before the meeting
— Identify the objective of the meeting.
— Clearly define the purpose and context.
— Establish expected outcomes and decisions to be made.
— Prepare questions to be addressed and points to clarify.
— Organize data to be presented and create visualizations
when relevant.
— Prepare and distribute an agenda including the start
time, expected duration, location, and topics to be dis-
cussed.
During the meeting
— Introduce participants and their roles.
— Lead the discussion in a structured manner, guiding the
analysis of data.
— Present the results of the analyses clearly and visually.
— Discuss the implications of the results, their interpreta-
tion, and their impact on the project.
— Take notes to record key points and decisions made.
— Summarize next steps and actions to be taken.
After the meeting
— Write and distribute a concise summary of the meeting,
emphasizing next steps and decisions made.
— Distribute relevant notes or data to the team.
— Confirm next steps and establish an action schedule.
— Solicit feedback to ensure nothing has been overlooked
and that expectations are aligned.
In summary, effective communication is essential to keep
partners informed about the project’s progress, while antici-
pating issues before they become obstacles. Whether through
email, meetings, or reports, it is crucial that every stakeholder
has a clear view of the progress and upcoming actions. Proactive
communication management helps better coordinate efforts, ma-
nage resources more efficiently, and ultimately ensure the success
66 Skills and Tools
of the project.
3.3 Tools of the Data Analyst
The choice of tools to use depends on the nature of the pro-
ject, the data analyst’s skills, and the type of data to be analy-
zed. It is essential to determine whether the data is quantitative
or qualitative. Quantitative data, often in the form of numbers
stored in spreadsheets and databases, can be easily analyzed
using tools such as Excel. On the other hand, analyzing qualita-
tive data, such as responses to open-ended surveys, emails, and
social media conversations, often requires more advanced data
analysis software.
There are many data analysis tools available on the market,
whether open-source or commercial software. Here are some of
the tools commonly used for data analysis :
Excel : The most well-known spreadsheet tool that allows
analyzing and manipulating data using formulas and
functions.
SQL : The Structured Query Language used to manage and
manipulate data stored in relational databases.
R : An open-source programming language used for statis-
tical computing and graphics, with useful packages such
as dplyr, ggplot2, tidyr, and caret.
Python : A versatile programming language used for data
analysis, machine learning, and web development, with
powerful libraries such as Pandas, NumPy, Scikit-learn,
TensorFlow, and PyTorch.
Tableau : A data visualization tool for creating interactive
visualizations and dashboards.
Looker : A tool appreciated by data analysts for creating
visualizations based on query results and presenting data
comprehensively.
3.4. Spreadsheets 67
Power BI : A Microsoft business analytics service for ana-
lyzing data and sharing insights.
RapidMiner : A data science platform for developing pre-
dictive machine learning models from data.
SAS : A statistical software suite used for data manage-
ment, analysis, and reporting.
SPSS : A statistical software for data analysis, data explo-
ration, and decision-making.
Spark : A big data processing engine for distributed com-
puting and large-scale data processing.
3.4 Spreadsheets
A spreadsheet is an essential digital tool for storing, organi-
zing, and sorting data. Proper data structuring is crucial to make
the most of it. Spreadsheets allow identifying patterns, grouping
information, and easily finding the necessary data. They also of-
fer valuable features such as formulas and functions.
3.4.1 Formulas
Formulas are one of the most valuable features of spread-
sheets. A formula is a set of instructions that performs a speci-
fic calculation using the data in the spreadsheet. Formulas allow
data analysts to perform powerful calculations automatically,
simplifying data analysis. Here are some tips to make the most
of formulas :
— All formulas start with an equal sign (for example,
=A2+A3).
— Use parentheses to define the order of operations.
— Use predefined functions to perform common calcula-
tions.
68 Skills and Tools
3.4.2 Error Detection Tips
Conditional formatting is a useful feature for spotting errors
in a spreadsheet. You can highlight all cells containing errors in
yellow, making it easier to correct them. Here are some tips to
avoid common errors in spreadsheets :
— Use data filtering to make the spreadsheet less complex
and more organized.
— Use frozen headers to maintain readability while scrolling.
— Use the asterisk (*) for multiplication, not the letter "X".
— Start every formula or function with an equal sign (=).
— Ensure that every open parenthesis has a corresponding
closing parenthesis.
— Customize formatting to make it easier to read.
— Create a separate tab for raw data and another for the
necessary data.
3.5 Structured Query Language (SQL)
Structured Query Language (SQL) is a computer program-
ming language that allows for the retrieval and manipulation of
data stored in a database. It plays a crucial role in communi-
cation between data analysts and databases. SQL is the most
widely used query language due to its simplicity and compati-
bility with various types of databases.
The query language allows data analysts to extract specific
information from one or more databases, making it easier to
learn and understand the data. It also enables selecting, creating,
adding, or uploading data for analysis purposes.
Data analysts often use a combination of spreadsheets and
SQL to manage and analyze their data, leveraging the power of
both formats.
3.6. R and Python for Data Analysts 69
Figure 3.2 – Caption
3.6 R and Python for Data Analysts
R and Python are two powerful and versatile programming
languages widely used in the field of data analysis. They offer ex-
tensive functionalities for data manipulation, visualization, and
analysis. Data analysts frequently use them to solve complex
problems and gain valuable insights from data.
3.6.1 R
R is an open-source programming language specifically de-
signed for statistical analysis and data visualization. It offers a
wide range of packages and libraries specifically developed for
data analysis tasks. Here are some reasons why R is popular
among data analysts :
— Richness of packages : R has thousands of packages cove-
ring various domains, from statistics to visualization to
machine learning. Packages like dplyr, ggplot2, tidyr, and
caret are widely used for data manipulation and graph
creation.
— Advanced visualization : R offers advanced visualization
features, enabling the creation of customized and infor-
mative graphs. The ggplot2 library is especially apprecia-
70 Skills and Tools
ted for creating elegant graphics.
— Robust statistics : R is designed by statisticians, making
it an ideal choice for complex statistical analyses. It pro-
vides advanced tools for statistical modeling and infe-
rence.
— Easy data import : R allows for easy importing of data
from various formats such as CSV, Excel, SQL, and more.
3.6.2 Python
Python is a versatile programming language widely used in
data science. It offers great flexibility for data analysis, machine
learning, and application development. Here’s why many data
analysts choose Python :
— Powerful libraries : Python has popular libraries like
Pandas, NumPy, Scikit-learn, TensorFlow, and PyTorch,
which facilitate data manipulation, analysis, and machine
learning model creation.
— Versatility : In addition to data analysis, Python is used
in a variety of fields, from web programming to task au-
tomation, making it a versatile language for IT professio-
nals.
— Large community : Python benefits from an active com-
munity and abundant online resources. There are many
tutorials, forums, and open-source libraries available to
assist in your work.
— Integration with other tools : Python can be easily inte-
grated with other tools and systems, making it suitable
for diverse professional environments.
Whether you choose R or Python, both languages offer po-
werful capabilities for data analysis. The choice between the
two often depends on personal preferences, existing skills, and
the specific requirements of your project. Many data analysts
even choose to master both languages to expand their skills and
versatility.
3.7. Data Analyst Tools 71
Remark 2. Spark is a powerful tool for processing large-scale
data, often more associated with big data and data engineering
than data analysis. However, it can also be used by data ana-
lysts when working on massive datasets that require significant
computational power.
3.7 Data Analyst Tools
The choice of analysis tools depends on the nature of the
project, the skills of the data analyst, and the type of data to be
analyzed. It is important to differentiate between quantitative
data, often in the form of numbers stored in spreadsheets or da-
tabases, and qualitative data, such as responses to open-ended
surveys, emails, or interactions on social media. While quanti-
tative data is often analyzed using tools like Excel, qualitative
data analysis usually requires more specialized software.
The market offers a wide range of data analysis tools, both
open source and commercial. Here are some of the most com-
monly used :
— Excel : The classic spreadsheet for analyzing and mani-
pulating data using formulas and functions.
— SQL : The structured query language, essential for ma-
naging and manipulating data in relational databases.
— R : An open-source language used for statistical com-
putation and graphics, with popular libraries like dplyr,
ggplot2, tidyr, and caret.
— Python : A versatile language suitable for data analysis,
machine learning, and web development, with powerful
libraries like Pandas, NumPy, Scikit-learn, TensorFlow,
and PyTorch.
— Tableau : A data visualization tool for creating interac-
tive visualizations and dashboards.
— Looker : A data visualization tool, especially useful for
creating dynamic reports from queries.
72 Skills and Tools
— Power BI : Microsoft’s business analysis solution, ideal
for sharing analysis and insights.
— RapidMiner : A platform dedicated to building machine
learning models and predictive analytics.
— SAS : A comprehensive statistical software suite for data
management, advanced analysis, and reporting.
— SPSS : A statistical tool for data analysis and decision-
making.
— Spark : A big data processing engine, primarily used for
distributed computing and large-scale data processing.
3.8 Spreadsheets
Spreadsheets are essential tools for storing, organizing, and
analyzing data. Proper structuring of information is crucial to
fully leverage their potential. They allow users to identify trends,
group data, and facilitate manipulation with advanced features
such as formulas and filters.
3.8.1 Formulas
Formulas are one of the most powerful features of spread-
sheets. They enable automatic calculations on data and auto-
mate analysis. To make the most of them, here are some tips :
— Always start your formulas with an equal sign (e.g.,
=A2+A3).
— Use parentheses to define the order of operations.
— Use predefined functions for common calculations such as
SUM, AVERAGE, or IF.
3.8.2 Tips for Detecting and Avoiding Errors
Conditional formatting is a great way to quickly spot errors
in a spreadsheet. For example, you can highlight cells containing
errors in yellow to make correction easier. Here are some other
tips to avoid common errors :
3.9. SQL Language 73
— Use filters to simplify the spreadsheet and avoid informa-
tion overload.
— Freeze headers to maintain readability when scrolling.
— Use the asterisk (*) for multiplication, not the letter "X".
— Remember that each formula must start with an equal
sign (=).
— Check that all parentheses are properly closed.
— Customize formatting to make data more readable.
— Create separate tabs to separate raw data from cleaned
data.
3.9 SQL Language
Structured Query Language (SQL) is a fundamental tool for
manipulating data stored in databases. It is commonly used by
data analysts to extract specific information from large datasets.
SQL is popular due to its simplicity and compatibility with a
wide variety of database systems.
SQL allows analysts to retrieve, organize, add, or modify
data in databases. It is often combined with other tools like Excel
to provide a more comprehensive approach to data analysis.
3.10 R and Python for Data Analysts
R and Python are two of the most widely used languages
by data analysts due to their power and flexibility. They enable
deep data processing, visualization, and analysis.
3.10.1 R
R is an open-source language primarily designed for statisti-
cal analysis and data visualization. It is particularly appreciated
for its dedicated libraries, such as dplyr, ggplot2, and tidyr,
which facilitate data processing and the creation of elegant gra-
phics.
74 Skills and Tools
Some advantages of R :
— Richness of Packages : R has thousands of packages
covering various fields, from statistics to machine lear-
ning.
— Advanced Visualization : The ggplot2 library is ideal
for creating complex data visualizations.
— Robust Statistics : R offers powerful tools for statistical
modeling and inference.
— Data Import : R makes it easy to import data from
various formats (CSV, Excel, SQL, etc.).
3.10.2 Python
Python is a versatile language widely used in data analysis
and machine learning. It stands out due to its libraries such
as Pandas, NumPy, Scikit-learn, TensorFlow, and PyTorch,
which simplify data processing and the creation of predictive
models.
Advantages of Python :
— Powerful Libraries : Tools like Pandas and NumPy
simplify data processing, while Scikit-learn and
TensorFlow facilitate machine learning.
— Versatility : Python is used in various fields, from web
programming to task automation.
— Large Community : Python has an active community
and abundant online resources.
— Easy Integration : Python integrates well with other
tools, making it suitable for diverse professional environ-
ments.
Remark 3. Although Spark is primarily associated with large-
scale data processing and big data, it can also be used by data
analysts, especially when dealing with massive datasets that re-
quire distributed computing power.
3.11. Example of Project Workflow 75
3.11 Example of Project Workflow
3.12 Project Team Meeting
In this strategic meeting, Mike, the sales manager, brought
the project team together to discuss the data analysis regar-
ding the profitability of bicycles. The goal was to clearly define
the roles and responsibilities of each member and ensure good
communication between stakeholders. Here is an overview of the
team members and the meeting content.
3.12.1 Participants
— Mike : Sales Manager, responsible for managing the pro-
ject.
— Kareem : Data Analyst, responsible for data analysis.
— Lee : Data Engineer, responsible for technical data ma-
nagement.
— Sandra : Financial Manager, responsible for evaluating
costs and financial optimization.
— Natacha : Communication Manager, responsible for in-
forming the hierarchy and managing external communi-
cations.
3.12.2 Meeting Flow
The meeting begins with Mike presenting the project’s ob-
jectives to the team :
"We need to analyze the profitability of our bike ren-
tal service in the city, understand the causes of decli-
ning profits, and make recommendations. To do that,
we must first understand the data before making any
recommendations."
76 Skills and Tools
3.12.3 Roles and Responsibilities
Mike emphasizes the need for clear communication and not
rushing actions before fully understanding the data analysis re-
sults. He reminds the team that the goal is to analyze the data
deeply, without bias, and draw objective conclusions.
Kareem’s Role
Kareem, as the data analyst, explains that he will focus on
exploratory data analysis. He mentions that the data is scattered
across different platforms and formats, which will require a data
cleaning and normalization phase. This phase, led by Lee, will be
crucial to ensure the data is compatible and ready for analysis.
Lee’s Role
Lee, the data engineer, presents his role : "I will work on
integrating data from different systems and cleaning it to make
it compatible with analysis. This step should take a few days,
depending on the volume and complexity of the data."
Sandra’s Role
Sandra, the financial manager, raises the issue of costs re-
lated to the project. She explains that it’s essential to have an
accurate estimate of expenses to ensure the project stays within
budget. She plans to closely monitor the costs associated with
the tools required for data cleaning and analysis. Sandra will
start by drafting a preliminary budget but will wait to finalize
the estimate once the exact needs are clearer.
Natacha’s Role
Natacha, the communication manager, emphasizes the im-
portance of regular communication with the hierarchy. She pro-
poses writing a summary note for management to keep them
informed of the project’s progress and objectives. "It’s crucial
to keep the hierarchy updated without delving into too many
technical details. Regular project progress updates are key to
avoiding ambiguity," says Natacha.
3.12. Project Team Meeting 77
3.12.4 Meeting Conclusion
Mike concludes the meeting by emphasizing the importance
of collaboration among team members. He reminds everyone
that the initial goal is to extract conclusions from the data, not
validate preconceived hypotheses. "We will track the progress
of the analysis with regular reports. If adjustments are needed
along the way, we will make them," he concludes.
The meeting clarified roles, defined an action plan for the
analysis phase, and ensured the team was ready to work together
to meet the project’s challenges.
78 Skills and Tools
Chapitre 4
Data Collection :
Methods, Tools, and Best
Practices
4.1 Choosing Data
Data, defined as units of measurement for raw information,
is generated in massive quantities around the world, every mi-
nute of every day [Provost and Fawcett, 2013]. Millions of SMS
messages are exchanged, hundreds of millions of emails are sent,
millions of online queries are made, and millions of videos are
viewed. These numbers continue to grow, creating a vast "ocean
of data" [Lyman and Varian, 2000]. According to Floridi, data
represents a "constitutive element of knowledge in the modern
world" [Floridi, 2011].
In the field of data analysis, a population refers to the entire
set of potential data values in a given dataset. However, collec-
ting data from the entire population can be extremely complex.
This is where the concept of sampling comes in, which, as Hand
says, "is the key to making the impossible possible : it allows us
to make reliable inferences about an entire population without
ever interacting with each of its members" [Hand, 2008].
79
80 Data Collection
A sample is a representative portion of the total population.
The way you choose to construct your sample will depend on
the nature of your project and your goals. Taking into account
ethics, relevance, and accuracy, each project requires a carefully
calibrated approach to data selection [Serres, 2008]. Sampling
has many advantages. For example, it is widely used in various
fields : opinion polls capture the preferences of a population on a
political issue ; clinical trials test a medical treatment on a group
of representative patients ; market studies analyze a segment of
buyers to anticipate consumer behavior. In each of these cases,
the sample allows the study’s objectives to be met without the
need to analyze the entire population. To ensure the quality of
the sample, different sampling methods are applied. Random
sampling gives each member of the population the same pro-
bability of being selected, minimizing bias. Moreover, stratified
sampling divides the population into homogeneous subgroups,
or strata, to ensure each segment is properly represented. Sys-
tematic sampling provides a simple approach by selecting every
n-th individual after a random first choice, ensuring a consistent
distribution of data.
4.1.1 Types of Data
Internal Data : These are data generated and stored wi-
thin a company or organization. These data often come
from internal systems such as databases, log files, or cus-
tomer relationship management (CRM) systems.
External Data : External data, on the other hand, come
from sources outside the company or organization. This
may include data from public sources, business partners,
social media, IoT sensors, or any other type of accessible
data.
Continuous Data : Continuous data are measured data
that can take almost any numerical value. For example,
temperature, stock prices, or wind speed are examples of
4.1. Choosing Data 81
continuous data.
Discrete Data : Unlike continuous data, discrete data are
counted and can only take a limited number of distinct
values. For example, the number of website visitors, the
number of products sold, or the number of phone calls
received are discrete data.
Qualitative Data : Qualitative data are subjective measu-
rements that describe qualities or characteristics. They
are not expressed in numerical terms. For example, the
color of a product, customer satisfaction expressed in
words, or the product category are qualitative data.
Quantitative Data : In contrast, quantitative data are
specific and objective measurements of numerical facts.
They are expressed in numerical terms. For example, the
revenue of a company, the size of a population sample, or
the number of defective products are quantitative data.
Nominal Data : Nominal data are a type of qualitative
data that are not classified according to a defined order.
For example, product colors (red, blue, green) are nomi-
nal data, as there is no intrinsic order between them.
Ordinal Data : Ordinal data are also qualitative data, but
they are classified according to a defined order or scale.
For example, a product rating on a satisfaction scale
(poor, average, good) is an ordinal data, as there is an
inherent quality order.
Structured Data : Structured data are data organized in
a specific format, typically in rows and columns in a da-
tabase or table. They are easy to query and analyze due
to their organization.
Unstructured Data : In contrast, unstructured data do
not follow a clear and organized format. They may in-
clude free-text entries, images, videos, audio recordings,
social media posts, etc. Analyzing these data can be more
complex due to their unstructured nature.
82 Data Collection
Binary Data : Data that can only be expressed as "yes" or
"no", "true" or "false", with no other nuance.
4.1.2 Collection of Secondary and Third-Party Data
If you are not collecting data internally, there is the possibi-
lity of acquiring data from secondary or third-party providers.
Secondary data are collected by separate entities before being
resold, while third-party data are offered by providers who did
not collect them initially. These third-party data may come from
various sources [Greenwade, 1993].
4.1.3 Ethical Considerations in Data Collection
When considering collecting your own data, it is imperative
to do so with absolute respect for ethics, preserving the rights
and privacy of the individuals involved. It is also important to
make wise decisions regarding the sample size. A random sample
from existing data may be suitable for certain projects, while
others may require a more strategic data collection focused on
specific criteria. Each project is unique, and the choice of data
should effectively serve to solve the problem at hand.
Remark 4. When collecting data, consider the time required,
especially if you need to track trends over a long period. If you
need a quick answer, it is possible that you may not have the
time to collect new data. In this case, using historical data that
is already available may be the most appropriate solution.
4.2 Data Reliability
Data reliability is fundamental for making informed decisions
and drawing accurate conclusions. Reliable data forms the foun-
dation of solid analysis and ensures the validity of results. Here
are some essential best practices to ensure this reliability :
Source Identification : Select reliable and reputable
4.2. Data Reliability 83
sources, such as academic publications, official databases,
or reports from recognized organizations in the field. The
authenticity and credibility of the source are the first
steps to ensure the quality of the data used.
Bias Assessment : It is essential to examine sources for
potential biases. Look for data from impartial sources or
diverse perspectives to ensure balanced analysis. A biased
source could influence the study’s conclusions, hence the
importance of critically evaluating the origin of the data.
Sample Size and Representativeness : Ensure that the
sample selected for studies or surveys is large enough and
representative of the population being studied. An in-
adequate sample can introduce bias and compromise the
statistical validity of the results.
Quality and Validation of Tools : The accuracy of data
depends in part on the collection and analysis tools.
Choose validated tools commonly used in the field and
ensure they are regularly updated and reliable. Reliable
software reduces errors and strengthens data accuracy.
Data Verification and Triangulation : Compare the
data with multiple sources or cross-check them with other
datasets to verify their consistency. This triangulation
helps strengthen data reliability and identify any anoma-
lies.
Data Integrity and Management : Once data is collec-
ted, store it securely to prevent any risk of manipula-
tion or alteration. Rigorous and transparent data ma-
nagement ensures its integrity throughout the analysis
process.
In summary, data reliability relies on a rigorous approach
at every stage : from collection to analysis. By applying these
principles, you ensure that the data is robust and the conclusions
drawn are well-founded.
84 Data Collection
4.3 Data Ethics and Confidentiality
Ethics and data confidentiality are crucial aspects to consi-
der during the collection, analysis, and use of data. The data
collected may contain sensitive information about individuals.
It is essential to adhere to ethical standards and privacy laws to
protect individuals’ privacy and ensure data integrity.
4.3.1 Data Ethics
Data ethics refers to the application of moral standards and
ethical principles in the collection, processing, and use of data.
It is based on principles such as transparency, accountability,
and respect for individual rights. Here are some key concepts to
consider :
Consent : When collecting data, it is essential to obtain in-
formed consent from participants. They must understand
the study’s objectives, the implications of their participa-
tion, and the intended use of their data. Consent must be
voluntary, informed, and revocable.
Anonymity and Confidentiality : Participants’ personal
information must remain protected during the data col-
lection and analysis process. Data anonymization, which
involves removing or masking personal information, is of-
ten necessary.
Data Security : The collected data must be stored securely
to prevent unauthorized access and to safeguard confiden-
tiality.
Minimization of Harm : Measures must be taken to limit
risks or negative consequences for participants. Data col-
lection procedures should be non-invasive, and support
should be available when needed.
Transparency : Participants must be transparently infor-
med about the study’s objectives, the methods used for
4.3. Data Ethics and Confidentiality 85
data collection, and the final use of their data. This also
includes explaining how data is processed and the algo-
rithms used.
Data Ownership : Individuals retain ownership of their
data. They control its use, processing, and sharing.
Laws and Standards
Laws and standards related to data protection vary across
countries. For example, the General Data Protection Regulation
(GDPR) in the European Union imposes strict requirements for
consent, breach notifications, and individuals’ rights to their per-
sonal data.
4.3.2 Data Confidentiality
Data confidentiality encompasses aspects related to the ac-
cess, use, and collection of data, as well as individuals’ legal
rights to their personal data. Data confidentiality laws protect
individuals’ rights and privacy, as exemplified by the GDPR in
Europe.
Data Anonymization
Data anonymization is a crucial process designed to pro-
tect individuals’ privacy by making it impossible to iden-
tify them from the collected data. This process is par-
ticularly relevant in the context of personal data protec-
tion laws, such as the General Data Protection Regula-
tion (GDPR) in Europe, which requires that information en-
abling the identification of an individual be strictly controlled
[Parliament and of the European Union, 2016].
Anonymization typically involves removing, hashing, or mas-
king sensitive personal information such as names, addresses, or
identification numbers. This process ensures that even if the data
is compromised, individuals remain protected against the risks of
re-identification [for Standardization, 2019]. For example, tech-
niques such as pseudonymization and encryption of unique iden-
86 Data Collection
tifiers (e.g., social security numbers or phone numbers) are com-
monly used to anonymize datasets [?].
The benefit of anonymization also lies in the fact that it al-
lows organizations to share or analyze data without compromi-
sing individuals’ confidentiality, thereby fostering research and
innovation. However, it is important to note that anonymiza-
tion must be carefully implemented : poorly executed anony-
mization methods can allow for inadvertent re-identification of
individuals, especially if additional datasets are used to cross-
reference the information [Narayanan and Shmatikov, 2008]. By
adhering to robust anonymization standards and techniques, the
risks can be significantly reduced, ensuring personal information
is effectively protected in an era where data is omnipresent.
4.4 Open Source Data
Open source data refers to data that is freely accessible to
the public and can be viewed, used, and redistributed without
legal or technical restrictions. It is often published under an
open license that permits free usage. Open source data includes
government statistics, scientific publications, and other publicly
available datasets.
Interoperability, or the ability of systems to share data
openly, is a key aspect of open data. These resources serve as a
valuable foundation for research, analysis, and product develop-
ment. [Greenwade, 1993]
4.5 Metadata
Metadata is information about the data. It provides details
about the origin, structure, and use of the data. Metadata en-
hances the quality and reliability of data by ensuring its accu-
racy, relevance, and timeliness. It also helps data analysts inter-
pret the data’s content and identify the causes of problems.
4.6. Data Integrity 87
4.5.1 Metadata Repository
A metadata repository is a database specifically designed to
store metadata. It describes the location, structure, and status
of metadata, thus facilitating the management of multiple data
sources. Metadata repositories ensure the consistency and relia-
bility of data used in analysis.
Metadata plays a crucial role in ensuring data quality and
compliance with ethical and confidentiality standards. They al-
low tracking of data access and ensure transparency in data
usage.
4.6 Data Integrity
Data integrity is a crucial concept that refers to the accu-
racy, consistency, and reliability of data throughout its lifecycle.
It aims to ensure that data remains complete, accurate, and
consistent from creation to deletion. Data integrity can be com-
promised in many different ways, and it is essential to implement
measures to preserve it.
Data can be compromised during replication, transfer, or ma-
nipulation. Data manipulation processes aim to organize and
structure data for easier analysis, but any error during this pro-
cess can jeopardize data integrity. Furthermore, human errors,
malware, hacking, or system failures can also compromise data
integrity.
To ensure data integrity, there are various methods, including
data validation and verification, access control, data encryption,
backups and disaster recovery, as well as regular maintenance
and monitoring of data systems. It is also essential to train staff
on data management and protection to minimize risks related
to data integrity.
88 Data Collection
4.6.1 Laws and Standards
Several laws and standards regulate the management and
protection of data integrity. For example, the GDPR (General
Data Protection Regulation) of the European Union imposes
strict requirements for the protection of personal data, including
measures to ensure data integrity.
4.7 Data Cleaning
Data cleaning is a critical step in data analysis, aiming to
ensure that the data used is accurate, complete, and consistent.
Following best practices in data cleaning enables analysts to
obtain more reliable insights and make better decisions.
— Define Data Quality Criteria : Before proceeding with
data cleaning, it is essential to define data quality criteria
to ensure their completeness, accuracy, and consistency.
— Identify Data Anomalies : It is important to identify any
anomalies in the data that could affect analysis, such as
missing data, outliers, duplicates, or inconsistencies.
— Clean the Data : Data cleaning involves removing du-
plicates, filling in missing data, and correcting errors or
inconsistencies.
— Normalize Data Formats : Data normalization ensures
consistency across different data sources.
— Validate Data Accuracy : Validate the accuracy of the
data by comparing it with other sources or performing
data profiling.
— Document Data Cleaning Steps : It is essential to docu-
ment all data cleaning steps to ensure transparency and
reproducibility of the process.
— Perform Regular Data Cleaning : Data cleaning should be
done regularly to maintain the accuracy and consistency
of the data over time.
4.8. Insufficient Data 89
4.8 Insufficient Data
A lack of data in the analysis field can pose a major challenge
for analysts, limiting the accuracy and scope of their analyses.
However, strategies exist to overcome this problem and obtain
useful insights :
— Collect Additional Data : When available data is insuf-
ficient, analysts can consider collecting additional data
specifically for their project.
— Use Sampling Techniques : Statistical sampling allows for
selecting a representative subset of data to analyze, thus
reducing the data volume while maintaining the quality
of information.
— Data Augmentation : Data augmentation involves ad-
ding synthetic data generated by algorithms or simula-
tion techniques to enrich the existing dataset.
— Data Interpolation : Interpolation allows for completing
missing data based on the values of neighboring data,
thus contributing to better data completeness.
— Collaborate with Other Organizations : Collaboration
with other organizations can provide access to data not
publicly available, thus expanding analysis possibilities
and improving result accuracy.
4.9 Corrupted Data
Dirty data refers to incomplete, erroneous, or irrelevant data
for the problem to be solved. Minor errors can have signifi-
cant long-term consequences. Dirty data can result from various
causes, including :
Duplicate Data : Repeated records, often due to incorrect
manual entries or data imports. This can lead to asymme-
tric analyses, inflated or inaccurate numbers, or confusion
when retrieving data.
90 Data Collection
Outdated Data : Expired information that should be re-
placed with more recent and accurate data due to staff
changes or outdated systems. Using outdated data can
lead to inaccurate decisions and analyses.
Incomplete Data : Data with missing critical fields, often
caused by incorrect data collection or entry. This can lead
to reduced productivity, inaccurate information, or the
inability to provide essential services.
Incorrect Data : Complete but inaccurate data, typically
due to human errors or false information. Using incorrect
data can result in decisions based on incorrect informa-
tion and loss of revenue.
Inconsistent Data : Data using different formats to re-
present the same information, often due to incorrect sto-
rage or errors during data transfer. This can lead to
contradictory data, causing confusion or preventing cus-
tomer classification or segmentation.
4.10 Example of a Bike Rental Project Work-
flow
In this project, data engineer Lee is responsible for extrac-
ting, transforming, and managing data from various sources, pre-
paring it for analysis by Kareem, who will focus on the actual
analysis.
4.10.1 Data Collection
Lee identifies the different available data sources, such as
SQL databases, JSON files, IoT sensors, and text files extrac-
ted from cloud platforms. He retrieves the data in a structured
manner to facilitate further processing.
When analyzing bike-sharing trip data, it is important to
distinguish between data and metadata. The primary data in-
4.10. Example of a Bike Rental Project Workflow 91
cludes information such as trip ID (ride_id), bike type (ri-
deable_type), trip start and end times (started_at, ended_at),
and trip geolocation with departure and arrival coordinates
(start_lat, start_lng, end_lat, end_lng). This data is directly
related to the events being studied, i.e., the trips made by bike-
sharing users.
On the other hand, metadata refers to additional informa-
tion that describes, contextualizes, or qualifies this data without
being directly related to the events themselves. For example, if
Lee adds a column indicating the data collection date, the data
source name, or the source itself, these would be considered me-
tadata. They help better understand the origin of the data, its
validity, and how it was collected but are not part of the ana-
lyzed content, such as trip details. In summary, while the data
focuses on the object being studied (bike trips), metadata pro-
vides essential contextual framework for interpreting this data.
4.10.2 Choosing Data Storage Format
Lee primarily selects CSV and Excel formats to ensure good
accessibility and data integrity. The files are stored in secure
cloud services such as AWS S3 or Google Cloud.
4.10.3 Data Extraction and Transformation
Lee performs data extraction and transformation :
— SQL : Extracting data via SQL queries and exporting it
to CSV.
— JSON : Using Python (pandas, json) to transform JSON
files into DataFrames, then into CSV or Excel.
— Text Files : Cleaning text data using Python and regular
expressions.
— IoT : Collecting real-time data via IoT APIs, cleaning
raw data, and structuring it for analysis.
92 Data Collection
4.10.4 Data Cleaning and Preprocessing
Lee performs thorough data cleaning by removing duplicates,
handling missing values through imputation, and normalizing
data to ensure consistency in formats and units.
4.10.5 Data Storage and Access
Data is stored in secure cloud environments, with strict ac-
cess permissions, ensuring confidentiality and effective collabo-
ration within the project team.
4.10.6 Transition to Data Analysis
Once the data is ready and structured, Kareem takes over
to perform data analysis, identify trends, and generate reports
and visualizations from the data collected and cleaned by Lee.
4.10.7 Transforming JSON to CSV in Python
Lee retrieves JSON data and transforms it into CSV using
Python. Here is an example code to accomplish this task :
import pandas as pd
import json
# Load the JSON file
with open ( ' data . json ' , 'r ') as file :
data = json . load ( file )
# Convert JSON to DataFrame
df = pd . json_nor malize ( data )
# Save the DataFrame to a CSV file
df . to_csv ( ' data . csv ' , index = False )
In this example, the JSON file is loaded and converted into
a DataFrame using the pd.json_normalize() function, then
exported to CSV with df.to_csv().
Chapitre 5
Data Cleaning :
Preparation and Optimal
Choices
5.1 Using a Sample
In data analysis, the decision to use a sample can be made at
different stages of the process : whether during data collection,
cleaning, or analysis. However, it is often advantageous to decide
to use a sample from the outset of data collection, especially if
the data size is too large to process efficiently or exceeds the
capabilities of available hardware.
5.1.1 Why Use a Sample ?
A sample is a representative subset of a population that al-
lows for extracting relevant information while saving time and
resources. This method is commonly used in statistical studies as
it enables drawing conclusions about an entire population from a
small number of observations. It helps guide research and make
informed decisions.
93
94 Data Cleaning
5.1.2 The Trade-offs of Sampling
While sampling is a powerful tool, it has certain trade-offs.
Indeed, using a sample means that we cannot guarantee that
the results obtained will be exactly the same as if the entire po-
pulation were analyzed. This can lead to sampling bias, where
some elements of the population are over-represented or under-
represented in the sample. To minimize this risk, it is strongly
recommended to use a random sampling method, which gives
every element of the population an equal chance of being selec-
ted, thus reducing potential biases [Pierson, 2015].
5.1.3 How to Determine the Sample Size ?
The sample size plays a crucial role in the precision and
reliability of results. Several factors need to be considered when
making this decision :
— Avoid small sample sizes : It is generally recommended
to use samples larger than 30, according to the Central
Limit Theorem (CLT), to ensure the results are represen-
tative of the population.
— Confidence level : A 95% confidence level is often cho-
sen, but it can be adjusted based on the specific needs of
the study.
— Margin of error : A larger sample size helps reduce the
margin of error and increase the precision of results.
— Costs and benefits : It is important to assess the costs
associated with increasing the sample size relative to the
benefits of increased precision.
5.1.4 Precautions and Strategies
It is essential not to underestimate the importance of choo-
sing an adequately sized sample. For example, sample sizes that
are too small (less than 30) are not reliable for generating mea-
ningful results due to data variability. Similarly, a sample size
that is too large could incur unnecessary costs without substan-
5.2. Using a Sample 95
tially improving the accuracy of the conclusions.
5.2 Using a Sample
A sample represents a subset of the total population and
must be selected in such a way as to faithfully reflect the cha-
racteristics of that population. The goal is to draw valid conclu-
sions for the entire population from a reduced number of ob-
servations. This saves time, resources, and reduces costs while
obtaining meaningful insights.
5.2.1 Sampling Bias
One of the major drawbacks of sampling is sampling bias,
which occurs when the selected sample is not representative of
the entire population. This bias can manifest in various ways,
such as when a particular group in the population is under-
represented or over-represented in the sample. To minimize this
bias, it is crucial to use random sampling, where each individual
in the population has an equal chance of being selected.
5.2.2 Factors to Consider in Determining Sample Size
When determining the sample size, several factors need to
be evaluated to ensure the accuracy of the results :
1. Sample size : A size smaller than 30 is generally consi-
dered insufficient for reliable statistical analysis.
2. Confidence level : A 95% confidence level is standard,
but it can be adjusted based on specific needs.
3. Reducing the margin of error : The larger the sample,
the smaller the margin of error, improving the accuracy
of the results.
4. Cost of sample collection : It is essential to weigh the
costs of a larger sample against the marginal increase in
precision.
96 Data Cleaning
5.3 Sample Size Calculation
There are various methods for calculating the sample size de-
pending on the chosen sampling method and the characteristics
of the target population.
5.3.1 Sample Size Calculation Formula
The most common formula is :
Z 2 · p · (1 − p)
n=
E2
Where :
— n is the sample size.
— Z is the Z-score corresponding to the desired confidence
level.
— p is the estimated proportion of the population.
— E is the desired margin of error.
5.3.2 Practical Example
Suppose you want to determine how many customers out
of 300,000 will purchase a new product in a supermarket. By
choosing a 95% confidence level and a 5% margin of error, you
can use the formula to calculate the required sample size. If
Z = 1.96, p = 0.5, and E = 0.05, the calculation would be :
1.962 · 0.5 · (1 − 0.5)
n= = 384.16
0.052
This means that a sample size of 385 customers (rounded) is
required to obtain a reliable estimate with the chosen parame-
ters.
Remark 5. Calculating the sample size is crucial to ensure
the representativeness and accuracy of the results. It allows for
trade-offs between cost, precision, and available resources while
5.4. Data Cleaning with Spreadsheets 97
ensuring that the results obtained are sufficiently reliable to make
informed decisions.
5.4 Data Cleaning with Spreadsheets
Data cleaning is a critical step to ensure your data is ac-
curate, complete, and usable for analysis. This process requires
not only technical skills but also careful attention to the quality
and consistency of the information. Below are the basic steps for
data cleaning, enriched with expert advice and best practices.
1. Importing Data from an External Source : Begin
by importing the data from the external source into your
spreadsheet. Make sure you understand the data format
and verify that the import was successful. Expert Tip :
Check the regional settings of the spreadsheet (such as
date or number format) before importing to avoid errors
due to format incompatibilities.
2. Creating a Backup Copy : Before making any modi-
fications, create a backup copy of the original data in a
separate workbook. This allows you to revert back if an
issue arises. Expert Advice : Never forget to keep a copy
of the original data in its raw form, even if significant
cleaning is performed.
3. Checking Data Structure : Ensure that the data is
presented in a tabular format, with similar data in each
column. The data should be consistent, with no unneces-
sary empty rows or columns.
4. Cleaning Text Data :
— Spelling Correction : Use built-in correction tools
in Excel or Google Sheets. You can also use the
FIND/REPLACE function to correct typographical er-
rors.
— Removing Duplicate Text : Use the REMOVE
98 Data Cleaning
DUPLICATES function in Excel or Google Sheets to eli-
minate repeated values in one or more columns.
5. Manipulating Columns : If you need to clean the co-
lumns, follow these steps :
(a) Insert a new column next to the original column.
(b) Use functions like LEFT, RIGHT, or MID to extract or
modify parts of the text.
(c) Apply the formula to the entire column.
(d) Copy the results and paste them as values to preserve
the transformations.
(e) Delete the original column and rename the new co-
lumn.
6. Handling Duplicates : Identify and remove duplicates.
Before removing duplicates, filter out the unique values
first to confirm the results. In Excel or Google Sheets, use
REMOVE DUPLICATES under the Data tab to automatically
remove duplicates from one or more columns.
7. Using Predefined Functions : Predefined functions
like LEFT, RIGHT, or CONCATENATE can be useful for ma-
nipulating text strings. Additionally, VLOOKUP or INDEX
functions are useful for performing matches or cross-
referencing between multiple data tables.
8. Correcting Input Errors : Use the built-in spell che-
cker in Excel and Google Sheets, or apply conditional
formulas to spot inconsistent values, such as incorrectly
formatted dates or invalid numeric values.
9. Merging and Splitting Columns :
— Merging Columns : Use the CONCATENATE function
or the & operator to merge two or more columns into
one.
— Splitting Columns : Use the TEXT TO COLUMNS
function to split data from one column into multiple
columns (e.g., separating first and last names).
5.4. Data Cleaning with Spreadsheets 99
10. Transforming Rows into Columns and Vice
Versa : Use the TRANSPOSE tool in Excel or Google Sheets
to rearrange rows into columns (or vice versa) for pivot
analysis.
11. Removing Unwanted Characters : Use functions like
TRIM, CLEAN, and SUBSTITUTE to remove spaces and in-
visible characters that may interfere with analysis.
12. Handling Numbers and Currencies : Check that
numbers and currency values are correctly formatted. Use
the cell formatting options in Excel or Google Sheets to
apply a specific numerical format (currency, percentage,
etc.).
13. Converting and Reformatting Dates and Times :
Standardize date and time formats to ensure consistency
in reports and analysis. Use the YYYY-MM-DD date format
to avoid errors due to regional date settings.
14. Documenting Errors and Modifications : Note all
errors and correction steps in a log or dedicated sheet.
This will help you track changes and avoid repetitive er-
rors.
15. Handling Missing Data : Identify missing data by
using the COUNTIF function to count missing values in
a column. Then, choose between imputation (mean, me-
dian, etc.) or deletion of the affected rows.
16. Backing Up Data : Before starting the cleaning pro-
cess, make a complete backup of the data to prevent ac-
cidental loss. Use versioning tools like Google Sheets to
keep track of different versions of the data.
17. Using Conditional Formatting : Use conditional for-
matting to quickly spot anomalies, such as missing or
outlier values. For example, you can highlight cells contai-
ning values greater than a certain threshold or empty cells
in red.
100 Data Cleaning
5.5 Examples of Data Cleaning in Excel and
Google Sheets
Here are some specific steps for cleaning data in Excel and
Google Sheets.
5.5.1 Identification and Handling of Missing Data
In Excel or Google Sheets, you can use the following formulas
to detect and handle missing data :
— Checking Empty Cells : Use IF(ISBLANK(A1),
"Blank", "Not Blank") to test if a cell is empty.
— Imputation by Mean : Use the formula
IF(ISBLANK(A1), AVERAGE(A1:A100), A1) to re-
place empty cells with the mean of the other values in
the column.
5.5.2 Removing Duplicates
To remove duplicates in Excel or Google Sheets :
— Select the column or data range.
— Go to the Data tab and click on Remove Duplicates.
5.5.3 Removing Unwanted Characters
To clean spaces or invisible characters in a cell, you can use :
— =TRIM(A1) to remove extra spaces.
— =SUBSTITUTE(A1,CHAR(160), "") to remove non-
printing spaces.
5.6 Data Cleaning : Key Steps and Tools
Data cleaning is a critical step in any data analysis project.
It includes several sub-steps such as identifying and handling
missing values, removing duplicates, correcting inconsistencies,
and transforming data. This process can be carried out using
various tools, including SQL, R, and Python.
5.6. Data Cleaning: Key Steps and Tools 101
5.6.1 Identification and Handling of Missing Data
Missing data can appear in various forms, such as absent
values or unfilled fields. The handling of missing data depends
on several factors, such as the nature of the project and the
amount of missing data. Options include :
— Data Imputation : Replace missing values with the
mean, median, or other statistical methods.
— Data Deletion : If the percentage of missing data is low,
the affected rows may be deleted.
R Example for Replacing Missing Values with the Mean :
data$column [ is . na ( data$column ) ] <- mean ( data$column , na
. rm = TRUE )
SQL Example for Replacing NULL Values with a Default
Value :
-- Replace NULL values with a default value
UPDATE ma_table
SET colonne = ' Default Value '
WHERE colonne IS NULL ;
Python Example for Imputing Missing Data :
import pandas as pd
# Load the data
df = pd . read_csv ( ' data . csv ')
# Impute missing values with the mean
df [ ' column '] = df [ ' column ' ]. fillna ( df [ ' column ' ]. mean () )
5.6.2 Removing Duplicates
Duplicates can appear when merging multiple sources or due
to issues in data collection. To ensure the quality of analyses, it
is essential to remove them.
SQL Example for Removing Duplicates :
-- Remove duplicates in the table
WITH UniqueData AS (
102 Data Cleaning
SELECT DISTINCT *
FROM ma_table
)
SELECT * FROM UniqueData ;
Python Example for Removing Duplicates :
import pandas as pd
df = pd . read_csv ( ' data . csv ')
# Remove duplicates
df = df . d ro p_ du p li ca te s ()
R Example for Removing Duplicates :
library ( dplyr )
df <- data . frame (
id = c (1 , 2 , 2 , 3) ,
value = c ( " A " , " B " , " B " , " C " )
)
# Remove duplicates
df_cleaned <- df % >% distinct ()
5.6.3 Correction of Inconsistencies and Typographical
Errors
Inconsistencies such as incorrect entries or misspelled names
need to be corrected. For example, an analyst might standardize
the names of stations or harmonize the units of measurement.
Python Code Example to Standardize Values :
df [ ' station_name '] = df [ ' station_name ' ]. str . lower () . str
. strip ()
5.6.4 Data Transformation
Once the data is cleaned, it must be transformed to match
the required format for analysis. This includes converting dates
into a uniform format, normalizing units of measurement, or
creating new derived variables.
5.7. Bike Rental Project Example 103
Python Example to Transform a Date into a Uniform For-
mat :
df [ ' started_at '] = pd . to_datetime ( df [ ' started_at ' ])
R Example to Convert a Date Column into Standard For-
mat :
df$started_at <- as . POSIXct ( df$started_at , format = " %Y -%
m -% d % H :% M :% S " )
5.6.5 Removing or Imputing Missing Data
A strategic choice must be made between removing or impu-
ting missing data. Removal can be chosen when the proportion
of missing data is low, but in other cases, imputation is prefe-
rable to avoid losing too much data.
If imputing missing data, several methods can be used :
— Imputation by Mean, Median, or Mode : Replace
missing values with values calculated from the existing
data.
— Imputation by Interpolation : Use interpolation me-
thods to predict missing values.
It is essential to test both approaches (removal and impu-
tation) to choose the one that best suits the project and the
available data.
5.7 Bike Rental Project Example
Referring to the bike rental project, here are some steps that
might be necessary.
R Example to Impute Missing Values with the Mean :
df_cleaned <- df % >%
mutate ( start_lat = ifelse ( is . na ( start_lat ) , mean (
start_lat , na . rm = TRUE ) , start_lat ) )
104 Data Cleaning
Python Example to Replace Missing Data with the Mean :
df [ ' start_lat '] = df [ ' start_lat ' ]. fillna ( df [ ' start_lat '
]. mean () )
5.7.1 Data Cleaning with R
Here is an example of data cleaning using the R language,
where missing values are replaced with the mean for numeric
columns and with a default value for text columns.
# Load the dplyr library
library ( dplyr )
# Example dataframe with missing values
df <- data . frame (
ride_id = c ( " 600 CFD130D0FD2A4 " , NA , " C1234 " , " " ) ,
rideable_type = c ( " electric_bike " , " classic_bike " , " " ) ,
started_at = c ( " 2022 -06 -30 [Link] " , " 2022 -07 -01
[Link] " , NA ) ,
ended_at = c ( " 2022 -06 -30 [Link] " , NA , " 2022 -07 -01
[Link] " ) ,
start_lat = c (41.89 , NA , 41.90) ,
start_lng = c ( -87.62 , -87.65 , " " ) ,
end_lat = c (41.91 , NA , 41.92) ,
end_lng = c ( -87.62 , NA , -87.63) ,
member_casual = c ( " casual " , " member " , " casual " )
)
# Data cleaning : replace NA with default values
df_cleaned <- df % >%
mutate (
ride_id = ifelse ( is . na ( ride_id ) | ride_id == " " , "
Unknown " , ride_id ) ,
rideable_type = ifelse ( is . na ( rideable_type ) |
rideable_type == " " , " Unknown " , rideable_type ) ,
started_at = ifelse ( is . na ( started_at ) | started_at == "
" , " Unknown " , started_at ) ,
ended_at = ifelse ( is . na ( ended_at ) | ended_at == " " , "
Unknown " , ended_at ) ,
start_lat = ifelse ( is . na ( start_lat ) , mean ( start_lat , na
. rm = TRUE ) , start_lat ) ,
start_lng = ifelse ( is . na ( start_lng ) | start_lng == " " ,
mean ( start_lng , na . rm = TRUE ) , start_lng ) ,
end_lat = ifelse ( is . na ( end_lat ) , mean ( end_lat , na . rm =
TRUE ) , end_lat ) ,
end_lng = ifelse ( is . na ( end_lng ) | end_lng == " " , mean (
end_lng , na . rm = TRUE ) , end_lng ) ,
member_casual = ifelse ( is . na ( member_casual ) |
member_casual == " " , " Unknown " , member_casual )
)
5.7. Bike Rental Project Example 105
# Display the cleaned data
print ( df_cleaned )
5.7.2 Removing Missing Values in R
Here is an example of R code that removes rows containing
missing data from the dataframe df :
# Load the dplyr library
library ( dplyr )
# Example dataframe with missing values
df <- data . frame (
ride_id = c ( " 600 CFD130D0FD2A4 " , NA , " C1234 " , " " ) ,
rideable_type = c ( " electric_bike " , " classic_bike " , " " ) ,
started_at = c ( " 2022 -06 -30 [Link] " , " 2022 -07 -01
[Link] " , NA ) ,
ended_at = c ( " 2022 -06 -30 [Link] " , NA , " 2022 -07 -01
[Link] " ) ,
start_lat = c (41.89 , NA , 41.90) ,
start_lng = c ( -87.62 , -87.65 , " " ) ,
end_lat = c (41.91 , NA , 41.92) ,
end_lng = c ( -87.62 , NA , -87.63) ,
member_casual = c ( " casual " , " member " , " casual " )
)
# Remove rows with missing values
df_cleaned <- df % >% na . omit ()
# Display the cleaned data
print ( df_cleaned )
106 Data Cleaning
Chapitre 6
Principles and Practices
of Data Aggregation and
Statistical Analysis
6.1 Importance and Techniques of Data Ag-
gregation
Data aggregation plays a key role in the analysis process, al-
lowing raw data to be transformed into actionable information.
It is most effective after data cleaning, ensuring that consolida-
ted information is ready for in-depth analysis. The role of the
data analyst in this step is crucial, as they ensure the structu-
ring, transformation, and preparation of data, thus facilitating
the extraction of relevant insights.
Aggregation is an essential step for generating descriptive
statistics and quantitative summaries. It is crucial for simpli-
fying large volumes of data while preserving their meaning and
relevance for strategic decision-making. Two key aspects are ad-
dressed here : aggregation methods and the tools used to carry
out this task.
107
108 Aggregation and Interpretation of Data
6.2 Correlation and Causality
It is crucial to understand the difference between correlation
and causality, as although these concepts may seem similar, their
implications are very different when it comes to drawing conclu-
sions from data [Pearl, 2009, ?].
Correlation : **Correlation** measures the strength and
direction of the relationship between two variables. For
example, a positive correlation indicates that when one
variable increases, the other tends to increase as well [?].
Conversely, a negative correlation suggests that when one
increases, the other decreases. However, it is important
to emphasize that **correlation does not necessarily im-
ply a cause-and-effect relationship**. A classic example
of correlation without causality is the relationship bet-
ween the number of ice creams sold and the number of
drownings : both increase in summer, but this does not
mean that buying ice cream causes drownings. This rela-
tionship is actually caused by a third factor, namely the
high temperature [?].
Causality : **Causality** goes beyond the mere associa-
tion between two variables and suggests that one event
causes another. To establish a causal relationship, not
only must the two events be correlated, but an in-depth
analysis of the underlying mechanisms and context must
demonstrate that one truly causes the other [?]. For
example, smoking is a proven cause of lung cancer : there
is a well-documented relationship between tobacco ex-
posure and an increased risk of developing the disease,
supported by longitudinal studies and experimental evi-
dence [?]. However, to establish a causal relationship, it is
crucial to have solid evidence and to consider all potential
variables.
When analyzing data, it is essential to keep the following in
6.3. Key Aggregation Methods 109
mind to avoid erroneous conclusions based solely on correlations
[?] :
— **Critically examine identified correlations**, seeking
additional evidence that may confirm or refute any cau-
sal relationship [?]. This can include advanced statistical
analyses such as interaction tests or experimental models.
— **Consider the context of the data** to assess whether
a causal relationship makes sense. For example, a cor-
relation between sleep hours and academic performance
might be due to a third factor, such as a healthier life-
style, rather than a direct cause-and-effect relationship
[?].
— **Understand the limitations of the analysis tools** you
are using. For instance, statistical tools may sometimes
detect correlations that are not causally significant, es-
pecially if the data are influenced by biases or hidden
variables [?].
Example of correlation and causality :
Take the example of the correlation between the number of
police officers in a city and crime rates. It might seem that an
increase in the number of police officers leads to a decrease in
crime. However, it is important to ask whether the increased
presence of police officers is truly the cause of the decrease in
crime, or whether other factors, such as improved social policies
or a better educational system, have contributed to the decrease.
In this case, in-depth analysis is necessary to determine whether
the observed correlation is caused by the actions of the police or
by other parallel interventions [?].
6.3 Key Aggregation Methods
Data aggregation includes various processes to transform raw
data into meaningful statistics and summaries suitable for ana-
lysis objectives. Data analysts employ different techniques to
110 Aggregation and Interpretation of Data
obtain specific summaries, depending on the study’s needs :
— Calculation of Descriptive Statistics : By calculating
measures such as the mean, median, standard deviation,
variance, minimum, and maximum, analysts provide a
concise and understandable overview of numerical data
distributions.
— Temporal Aggregation : Data are often grouped by
periods (days, months, quarters, etc.), making it easier
to identify trends and seasonal variations. This type of
aggregation is essential in longitudinal analyses and time
series studies.
— Calculation of Rates and Percentages : Using rates
and percentages helps evaluate proportions or ratios in
data, offering a useful comparative perspective for ana-
lysts in various contexts, such as monitoring KPIs (key
performance indicators).
— Categorization and Grouping of Data : By grouping
data into relevant categories, it becomes possible to ana-
lyze groups of similar items and examine the differences
or similarities between these groups.
6.4 Tools and Software for Data Aggregation
Data analysts use a range of tools to carry out data aggrega-
tion, including specialized software like Python (with libraries
such as Pandas), R, SQL, and BI (Business Intelligence) plat-
forms like Tableau and Power BI. These tools allow for efficient
aggregation of large data sets and visualization to draw opera-
tional conclusions. Here is an overview of the main tools and
software, along with concrete application examples across va-
rious sectors.
6.4. Tools and Software for Data Aggregation 111
6.4.1 Main Tools and Software Used
Data analysts use a range of tools to aggregate and analyze
data, including :
Spreadsheets : Tools like Excel and Google Sheets pro-
vide basic functionality for data aggregation and clea-
ning, such as calculating averages, sums, or custom for-
mulas to segment data.
Programming Languages : Languages like Python (via
the Pandas library), R, SQL, and Julia allow for advan-
ced aggregation, automated cleaning, and manipulation
of large data sets.
Database Management Systems : Relational databases
(MySQL, PostgreSQL, SQL Server) facilitate the storage
of large data sets while enabling complex aggregation
operations, notably via SQL.
Business Intelligence (BI) Tools : Platforms like Ta-
bleau, Power BI, and QlikView enable interactive visua-
lizations, making it easier to interpret aggregated data
and communicate insights.
6.4.2 Examples of Data Aggregation by Sector
The applications of data aggregation vary across sectors.
Here are some concrete examples :
Retail :
— Aggregating sales data to obtain key indicators such as
total revenue, average sales per product, or segmentation
by region.
— Using spreadsheets to calculate sales averages, identify
the most popular products, or regions with high poten-
tial.
Marketing :
— Aggregating traffic data to analyze user behavior, using
Python or SQL to automate calculations on large data
112 Aggregation and Interpretation of Data
sets.
— Identifying the most visited pages and traffic sources, al-
lowing for adjustments in marketing strategies based on
user behavior.
Financial Services :
— Segmenting clients according to demographic and finan-
cial criteria, using databases to store and query client
data.
— Creating risk profiles, allowing for the offering of financial
products tailored to each customer segment.
Healthcare :
— Aggregating patient data to analyze health trends and
assess the effectiveness of treatments.
— Using BI tools to visualize medical data and identify pat-
terns in clinical results.
In all of these examples, aggregation allows analysts to syn-
thesize information and provide strategic insights to decision-
makers. The tools used, whether spreadsheets, programming
languages, or BI platforms, are essential for obtaining reliable
insights and facilitating decision-making.
Common Aggregation Functions in Spreadsheets
In the field of data analysis, several common aggregation
functions are used to summarize data and extract relevant sta-
tistics. Here are some examples using the VLOOKUP, VALUE,
and TRIM functions.
VLOOKUP : This function looks up a specific value in a given
column and returns an associated piece of information.
Here is an example of using VLOOKUP to find the name
of a product based on its code :
Listing 6.1 – VLOOKUP Example in Excel
= VLOOKUP ( A2 , Products ! A :B , 2 , FALSE )
VALUE : Converts a text string into a number for easier arith-
metic calculations.
6.4. Tools and Software for Data Aggregation 113
Listing 6.2 – VALUE Example in Excel
= VALUE ( " 123 " )
TRIM : Removes extra spaces around a string.
Listing 6.3 – TRIM Example in Excel
= TRIM ( A2 )
6.4.3 Aggregation in SQL
SQL allows powerful data aggregation, particularly for rela-
tional databases. Common aggregation functions include :
Common Aggregation Functions in SQL
SUM : Calculates the sum of a column of numeric values,
commonly used for calculations of amounts like sales or
expenses.
AVG : Returns the average of the values in a numeric column,
useful for trend analysis.
COUNT : Counts the number of rows, often used to quantify
occurrences.
MAX and MIN : Used to find the maximum and minimum
values in a column, for example, to identify record sales
or lowest prices.
Here is an example of an SQL query to calculate the total sales
by product for a given period :
Listing 6.4 – SQL Example - Sales by Product
SELECT Product , SUM ( Amount ) AS TotalSales
FROM Sales
WHERE Date BETWEEN ' 2023 -01 -01 ' AND ' 2023 -12 -31 '
GROUP BY Product ;
This query groups the sales by product and calculates the
total sales amount for each product.
114 Aggregation and Interpretation of Data
6.4.4 Aggregation in Python
Python is commonly used for data analysis thanks to libra-
ries like Pandas. Before showing an example, here are some po-
pular libraries and aggregation functions used to manipulate and
analyze data in Python.
Common Libraries and Aggregation Functions in Python
Pandas : A powerful library for data manipulation, offering
functions like groupby(), agg(), and pivot_table() for
aggregating data.
NumPy : Complementary to Pandas, this library provides
functions for fast numerical calculations, such as sum(),
mean(), and median().
groupby() : In Pandas, this function allows data to be grou-
ped by one or more columns and performs aggregation
calculations like sum or mean.
pivot_table() : This function is used to create pivot
tables, providing a summary view of data grouped by
multiple criteria.
Here is an example using Pandas to aggregate sales data by
product :
Listing 6.5 – Python Example - Aggregating Sales by Product
import pandas as pd
# Loading the data
df = pd . DataFrame ({
' Product ': [ 'A ' , 'B ' , 'A ' , 'C '] ,
' Amount ': [100 , 200 , 150 , 300] ,
' Date ': [ ' 2023 -01 -01 ' , ' 2023 -06 -15 ' , '
2023 -08 -20 ' , ' 2023 -12 -05 ']
})
# Aggregating by product
result = df . groupby ( ' Product ') [ ' Amount ' ]. sum () .
reset_index ()
print ( result )
In this example, the data is grouped by product, and the
sum of amounts is calculated for each group.
6.4. Tools and Software for Data Aggregation 115
6.4.5 Aggregation in R
R is a language specialized for statistical analysis, with po-
werful aggregation functions. Before showing an example, here
are some commonly used libraries and functions for data aggre-
gation in R.
Common Functions and Libraries in R
dplyr : A very popular R package for data manipulation, of-
fering powerful functions like group_by(), summarize(),
and mutate() to perform aggregation and transformation
tasks.
tidyr : Another complementary library that helps transform
data into wide or long formats, often used in conjunction
with dplyr to prepare data before or after aggregation.
group_by() : This function from dplyr is used to group
data by one or more variables.
summarize() : Used with group_by() to apply aggregation
functions such as sum, mean, or other statistics to groups
of data.
Base R : Functions like aggregate() and tapply() in Base
R also allow data aggregation, although dplyr is often
preferred for its concise syntax and flexibility.
Here is an example using dplyr to aggregate sales by pro-
duct :
Listing 6.6 – R Example - Aggregating Sales by Product
library ( dplyr )
# Creating the data
data <- data . frame (
Product = c ( " A " , " B " , " A " , " C " ) ,
Amount = c (100 , 200 , 150 , 300) ,
Date = as . Date ( c ( " 2023 -01 -01 " , " 2023 -06 -15 " , "
2023 -08 -20 " , " 2023 -12 -05 " ) )
)
# Aggregating by product
result <- data % >%
116 Aggregation and Interpretation of Data
group_by ( Product ) % >%
summarize ( TotalSales = sum ( Amount ) )
print ( result )
This example creates a fictional sales dataset, groups it by
product, and calculates the sum of amounts.
6.5 Data Interpretation
Data interpretation is a fundamental process in data analy-
sis, where analysts seek to make sense of raw numbers by contex-
tualizing them, identifying trends, and formulating predictions
or recommendations. This step goes far beyond the application
of mathematical formulas ; it also relies on critical and philoso-
phical thinking that guides reasoning towards a deep understan-
ding of the results obtained. Data interpretation is an exercise
that requires scientific rigor, objectivity, but also the ability to
question and challenge underlying assumptions.
The philosophy of science teaches us that data interpreta-
tion is never a final or definitive process. Karl Popper, in The
Logic of Scientific Discovery (1972) [Popper, 1972], emphasizes
that science is primarily a dynamic process of "conjectures and
refutations." Therefore, data analysis should be seen as an at-
tempt to understand observed phenomena, a hypothesis that
must be continuously tested and confronted with new informa-
tion. Furthermore, Charles Sanders Peirce, a pragmatist philo-
sopher, introduces the concept of abduction, a reasoning process
that involves formulating the most plausible explanation of a set
of observed facts. This method is particularly relevant for data
analysis as it allows proposing interpretation models that must
then be validated by experience or further testing.
However, data interpretation is not only a matter of metho-
dology and logic. It also requires a contextual understanding
of the data. Nate Silver, in his book The Signal and the Noise
(2012) [Silver, 2012], stresses the importance of linking data to
6.5. Data Interpretation 117
real-world situations. For him, statistics only make sense when
interpreted in context. An analyst must understand where the
data comes from, the framework in which it was collected, and
the external factors that may influence the results. The absence
of this context can lead to misinterpretations, as evidenced by
many examples in business and public policy.
There are several common mistakes in data interpretation
that can compromise the quality of the analysis. One of the
most frequent is confusing correlation with causality. It is easy
to wrongly conclude that one variable causes another simply
because they are correlated. For example, studies have shown a
correlation between ice cream consumption and high tempera-
tures. However, concluding that heat causes the purchase of ice
cream would be a mistake because this relationship is likely due
to a third factor : summer. Similarly, it is important to avoid
neglecting uncertainties in the results. The analyst must account
for data variability, particularly through the use of confidence
intervals and statistical tests to assess the robustness of their
conclusions.
Objectivity is also essential in data interpretation. The ana-
lyst must be wary of cognitive biases, such as confirmation
bias, which might push them to interpret results in a way that
confirms their expectations or hypotheses. In a broader context,
as philosopher and statistician George Box points out, "all mo-
dels are wrong, but some are useful" (1976) [Box, 1976]. This
highlights that even if a data interpretation may be incorrect or
incomplete, it can still provide useful information for decision-
making, as long as it is used consciously and with discernment.
Finally, the ethics of data interpretation should not be un-
derestimated. Analysts have a responsibility to present their
results transparently, clearly indicating the limitations of the
data and models used. The misuse or manipulation of data, for
example to justify specific policies or actions, can have serious
consequences. The ethics of data analysis also involves respect
for data confidentiality and a commitment to avoiding the ex-
118 Aggregation and Interpretation of Data
ploitation of biases in the data for self-serving purposes.
Thus, data interpretation is a complex process that combines
mathematical rigor, critical thinking, and ethical responsibility.
It relies on a delicate balance between understanding the obtai-
ned results, accepting their limitations, and the ability to apply
them wisely in specific contexts. The analyst must always re-
member that data is not an end in itself, but a means of better
understanding the world around us and making informed deci-
sions.
6.6 Mathematical Formulas in Data Analysis
Mathematical formulas play a central role in data analysis,
facilitating the calculation of various metrics that allow sum-
marizing, comparing, and interpreting data. These formulas are
essential for extracting meaningful information from complex
data sets. Below is an overview of some commonly used formu-
las, along with their purpose, relevance, and examples of appli-
cation.
Mean Formula : The arithmetic mean is a key indicator
that represents the central value of a data set. It is cal-
culated using the following formula :
Pn
i=1 xi
Mean =
n
Where xi represents each value in the sample, and n is
the total number of values in the set.
Purpose : To obtain an overall estimate of the central
tendency of the data.
Relevance : The mean is particularly useful for sym-
metric data sets, but it can be influenced by extreme va-
lues, making it less robust in some cases (see median).
Median Formula : The median divides the data set into
two equal parts, identifying a central value in the data,
6.6. Mathematical Formulas in Data Analysis 119
regardless of the distribution of values. If n is odd, the
median is the value at the center of the sorted sample. If
n is even, it is calculated as the average of the two central
values.
Purpose : To identify the value that separates the
data into two equal-sized groups.
Relevance : The median is particularly useful for
skewed data or data with outliers, as it is not influenced
by extremes.
Standard Deviation Formula : The standard deviation
measures the dispersion of values around the mean. It is
calculated by the formula :
sP
n
i=1 (xi − Mean)2
Standard Deviation =
n
A high standard deviation indicates a wide dispersion,
while a low standard deviation suggests that the values
are close to the mean.
Purpose : To measure the variability of data relative
to the mean.
Relevance : The standard deviation helps evaluate
data stability and is crucial for determining the degree of
confidence in predictions and results.
Correlation Formula : Pearson’s correlation measures the
strength and direction of the linear relationship between
two data sets. The formula is as follows :
Pn
i=1
(xi − Mean(X))(yi − Mean(Y ))
Correlation =
(n − 1) · Standard Deviation(X) · Standard Deviation(Y )
Purpose : To evaluate the direction and strength of
the relationship between two variables.
Relevance : A high correlation indicates a strong
relationship between variables, which is crucial for pre-
dictions and multivariate analysis.
120 Aggregation and Interpretation of Data
Probability Formula : The probability of an event A in a
sample space is given by :
Number of favorable outcomes for A
P (A) =
Total number of possible outcomes
Purpose : To estimate the likelihood of an event oc-
curring.
Relevance : Probability is fundamental in decision-
making in uncertain environments and forms the basis of
many modeling and simulation approaches.
Normal Distribution Formula : The normal distribu-
tion, or Gaussian distribution, is used to model pheno-
mena whose values are symmetrically distributed around
the mean. Its probability density function is given by the
formula :
!
1 (x − µ)2
f (x) = √ exp −
σ 2π 2σ 2
Where µ is the mean and σ is the standard deviation of
the distribution.
Purpose : To model data symmetrically distributed
around a mean.
Relevance : The normal distribution is used in hy-
pothesis testing and statistical modeling, especially for
making estimations on large data sets.
Minimum and Maximum Formula : The formulas for
minimum and maximum allow identifying the smallest
and largest values in a data set, which is useful for spot-
ting extremes and anomalies.
Purpose : To identify the extreme values in a data
set.
Relevance : These formulas are essential for detec-
ting anomalies and conducting "extreme size" analyses or
outlier detection.
6.6. Mathematical Formulas in Data Analysis 121
6.6.1 Practical Example
Let’s consider a concrete example : a group of five students
received the following scores on an exam : 85, 92, 78, 95, 88. Let’s
apply the formulas above to better understand this distribution.
— Average : Average = 85+92+78+95+88
5
= 87.6
— Median : The sorted data is 78, 85, 88, 92, 95. Therefore,
the median is 88.
— Standard Deviation : By applying the standard deviation
formula, we get Standard Deviation ≈ 6.22.
— Correlation : Imagine we measure the correlation between
study hours and grades. Suppose the correlation is r =
0.85, indicating a strong and positive relationship.
— Probability : If we want to know the probability of getting
a grade above 90, we could use the normal distribution
function to calculate this probability, assuming the grades
follow a normal distribution.
— Minimum : The minimum grade is 78.
— Maximum : The maximum grade is 95.
6.6.2 Mathematics and Tools
Mathematical and statistical tools are essential for perfor-
ming these calculations in data analysis software. Here are some
commonly used functions in this area :
Addition (SUM) : Used to sum a series of numerical va-
lues. It forms the basis for aggregations and total calcu-
lations.
Division (DIV) : Used to calculate ratios or rates, espe-
cially in the analysis of relative results.
Multiplication (MULT) : Essential for calculating pro-
ducts or forecasted values.
Logarithm (LOG) : A function used to transform expo-
nential data into linear values.
122 Aggregation and Interpretation of Data
These tools simplify the execution of complex calculations
and analyses, enabling analysts to better understand and inter-
pret data.
6.7 Example of the Bike Rental Project
6.7.1 Context
In the context of the bike rental project, the collected data
includes information about the bike trips, such as ride IDs, bike
types, start and end times, start and end stations, geographical
coordinates, and subscriber types (member or casual). After Lee,
the data engineer, has collected and cleaned this data, Kareem,
the data analyst, takes over to perform a deeper analysis.
Here is an excerpt from the cleaned data :
[
{
"ride_id": "600CFD130D0FD2A4",
"rideable_type": "electric_bike",
"started_at": "2022-06-30 [Link]",
"ended_at": "2022-06-30 [Link]",
"start_station_name": "Station A",
"start_station_id": 1,
"end_station_name": "Station B",
"end_station_id": 2,
"start_lat": 41.89,
"start_lng": -87.62,
"end_lat": 41.91,
"end_lng": -87.62,
"member_casual": "casual"
}
]
6.7. Example of the Bike Rental Project 123
6.7.2 Data Analysis by Kareem
After the data cleaning by Lee, Kareem begins the data ana-
lysis to draw relevant conclusions. Here are some of the main
steps he follows :
1. Usage Trend Analysis
Kareem explores the data to identify trends in bike usage,
such as peak times or differences in frequency between subscriber
types. For example, he could calculate the average trip duration.
Example of calculating average trip duration (Python) :
# Calculate the duration of each trip in minutes
df [ ' started_at '] = pd . to_datetime ( df [ ' started_at ' ])
df [ ' ended_at '] = pd . to_datetime ( df [ ' ended_at ' ])
df [ ' duration '] = ( df [ ' ended_at '] - df [ ' started_at ' ]) . dt
. total_seconds () / 60
# Calculate the average duration
mean_duration = df [ ' duration ' ]. mean ()
Example in R :
# Convert dates to datetime format
df$started_at <- as . POSIXct ( df$started_at )
df$ended_at <- as . POSIXct ( df$ended_at )
# Calculate the duration in minutes
df$duration <- as . numeric ( difftime ( df$ended_at ,
df$started_at , units = " mins " ) )
# Calculate the average duration
mean_duration <- mean ( df$duration , na . rm = TRUE )
Example in a Spreadsheet :
In a spreadsheet, you can use the following function to cal-
culate the average duration by subscription type :
= AVERAGEIF(A2 : A100, ”member”, B2 : B100)
where A2:A100 represents the column for subscription types and
B2:B100 represents the column for trip durations.
2. User Segmentation
Kareem also analyzes user behavior based on their subscrip-
tion type (member_casual). He might want to know if members
124 Aggregation and Interpretation of Data
use bikes more often than casual users.
Example of analysis of casual vs member users (Python) :
# Calculate average duration by subscription type
m e a n _ d u r a t i o n _ b y _ t y p e = df . groupby ( ' member_casual ') [ '
duration ' ]. mean ()
Example in R :
# Calculate average duration by subscription type
m e a n _ d u r a t i o n _ b y _ t y p e <- aggregate ( duration ~
member_casual , data = df , FUN = mean )
Example in a Spreadsheet :
In a spreadsheet, you can use the COUNTIF function to cal-
culate the number of trips by subscription type :
= COUNTIF(A2 : A100, ”StationA”)
where A2:A100 represents the column for start stations.
3. Identifying Popular Stations
Kareem analyzes geographical data to identify the most used
stations. This can help manage stations better and adjust bike
availability according to demand.
Example of visualizing popular stations (Python) :
import matplotlib . pyplot as plt
# Count the number of trips by start station
s t a r t _ s t a t i o n _ c o u n t s = df [ ' s t a r t _ s t a t i o n _ n a m e ' ].
value_counts ()
# Visualize the most popular stations
s t a r t _ s t a t i o n _ c o u n t s . plot ( kind = ' bar ' , title = ' Popularity
of Start Stations ')
plt . xlabel ( ' Start Stations ')
plt . ylabel ( ' Number of Trips ')
plt . show ()
Example in R :
# Count trips by start station
s t a r t _ s t a t i o n _ c o u n t s <- table ( d f $ s t a r t _ s t a t i o n _ n a m e )
# Visualize popular stations
barplot ( start_station_counts , main = " Popularity of
Start Stations " ,
xlab = " Start Stations " , ylab = " Number of Trips " )
6.7. Example of the Bike Rental Project 125
Example in a Spreadsheet :
In a spreadsheet, to calculate the number of trips by start
station, you can use the COUNTIF function :
= COUNTIF(A2 : A100, ”StationA”)
where A2:A100 represents the column for start stations.
4. Forecasting and Modeling
Kareem can also build statistical or machine learning models
to predict future bike demand, based on hours or days of the
week.
Example of a regression model for bike demand based on the
hour (Python) :
from sklearn . linear_model import L i n e a r Re g r e s s i o n
# Prepare the data
df [ ' hour '] = df [ ' started_at ' ]. dt . hour
X = df [[ ' hour ' ]]
y = df [ ' duration ']
# Create the regression model
model = L i n e ar R e g r e s s i o n ()
model . fit (X , y )
# Predict the duration for a given hour
p r e d i c t e d _ d u r a t i o n = model . predict ([[14]]) #
Prediction for 2 PM
Example in R :
library ( caret )
# Create an hour variable
df$hour <- as . numeric ( format ( df$started_at , " % H " ) )
# Create the linear regression model
model <- lm ( duration ~ hour , data = df )
# Prediction for a given hour
p r e d i c t e d _ d u r a t i o n <- predict ( model , newdata = data .
frame ( hour = 14) )
Example in a Spreadsheet :
126 Aggregation and Interpretation of Data
In a spreadsheet, you can use the LINEST function to predict
the duration based on the hour :
= LINEST(B2 : B100, A2 : A100)
where B2:B100 is the range for durations and A2:A100 is the
range for start hours.
6.8 Analysis of Bike Rental Trends
6.8.1 1. Weekend Rentals
Kareem explores the trips made during the weekend to eva-
luate if there is increased bike demand on those days. This can
help better manage bike availability during these periods.
Example in Python for extracting weekend trips :
# Extract weekend trips
df [ ' day_of_week '] = df [ ' started_at ' ]. dt . dayofweek
weekend_rides = df [ df [ ' day_of_week '] >= 5] # Saturday
(5) and Sunday (6)
# Calculate average duration during the weekend
m e a n _ d u r a t i o n _ w e e k e n d = weekend_rides [ ' duration ' ]. mean
()
Example in R :
# Extract weekend trips
df$d ay_of_we ek <- weekdays ( df$started_at )
weekend_rides <- subset ( df , day_of_week % in % c ( "
Saturday " , " Sunday " ) )
# Calculate average duration during the weekend
m e a n _ d u r a t i o n _ w e e k e n d <- mean ( weekend_rides$duration ,
na . rm = TRUE )
6.8.2 2. Holiday Periods
Kareem can identify holiday periods to analyze bike demand
during those times. This can include specific public holidays such
as Christmas or summer vacations.
Example in Python for extracting trips during holidays :
6.8. Analysis of Bike Rental Trends 127
# List of holidays to analyze ( example with fictional
dates )
holiday_dates = pd . to_datetime ([ ' 2022 -12 -25 ' , '
2022 -07 -04 ' ])
# Extract trips during holidays
holiday_rides = df [ df [ ' started_at ' ]. dt . date . isin (
holiday_dates . date ) ]
# Calculate average duration during holidays
m e a n _ d u r a t i o n _ h o l i d a y = holiday_rides [ ' duration ' ]. mean
()
Example in R :
# List of holiday dates
holiday_dates <- as . Date ( c ( " 2022 -12 -25 " , " 2022 -07 -04 " ) )
# Extract trips during holidays
holiday_rides <- subset ( df , as . Date ( df$started_at ) % in %
holiday_dates )
# Calculate average duration during holidays
m e a n _ d u r a t i o n _ h o l i d a y <- mean ( holiday_rides$duration ,
na . rm = TRUE )
6.8.3 3. Peak Hours
Peak hours usually refer to times when demand is highest,
such as in the morning and evening. Kareem can analyze these
periods to adjust bike availability.
Example in Python for identifying peak hours :
# Extract start hour
df [ ' hour '] = df [ ' started_at ' ]. dt . hour
# Identify peak hours ( e . g . , between 7 AM and 9 AM , and
between 5 PM and 7 PM )
p e a k _ h o u r s _ ri d e s = df [( df [ ' hour '] >= 7) & ( df [ ' hour ']
<= 9) | ( df [ ' hour '] >= 17) & ( df [ ' hour '] <= 19) ]
# Calculate average duration during peak hours
m e a n _ d u r a t i o n _ p e a k _ h o u r s = p e a k _ h o u r s _ r i d e s [ ' duration '
]. mean ()
Example in R :
# Extract start hour
df$hour <- as . numeric ( format ( df$started_at , " % H " ) )
# Identify peak hours
128 Aggregation and Interpretation of Data
p e a k _ h o u r s _ ri d e s <- subset ( df , ( hour >= 7 & hour <= 9)
| ( hour >= 17 & hour <= 19) )
# Calculate average duration during peak hours
m e a n _ d u r a t i o n _ p e a k _ h o u r s <- mean (
peak_hours_rides$duration , na . rm = TRUE )
6.8.4 4. Optimization of Bike Management
Once these analyses are conducted, Kareem can use this
knowledge to optimize bike management. For example :
— Availability during peak hours : By identifying peak
hours, Kareem can ensure that stations are well-stocked
with bikes during these periods.
— Repositioning bikes during weekends or holidays :
If demand increases during these times, it may be neces-
sary to redistribute bikes between popular stations.
— Maintenance and cleaning during off-peak per-
iods : Outside of peak hours and holiday periods, it may
be easier to perform bike maintenance.
Kareem can also build statistical or machine learning models
to predict future bike demand, based on hours or days of the
week.
6.8.5 Conclusion
The bike rental project highlights the importance of colla-
boration between data engineers and data analysts. While Lee
handles data collection and cleaning, Kareem uses his analytical
skills to extract useful insights and make recommendations that
can optimize bike usage and improve the user experience.
Chapitre 7
The Art of Data
Visualization and
Effective Communication
7.1 Data Visualization : An Essential Tool for
Communication
Data visualization involves graphically representing data to
make it easier to understand. It transforms raw numbers into
meaningful visual information. An effective visualization must
have a purpose, a story, data, and a visual form. As Edward
Tufte, one of the most respected experts in visualization, points
out, “The complexity of data is not the problem ; it’s the
means of making it accessible, understandable, and memorable”
[Tufte, 2001]. Visualization allows information to be structured
so that decision-makers can quickly grasp its nuances without
getting lost in a sea of numbers.
A well-crafted data visualization can make the difference bet-
ween total confusion and a real understanding of a complex pro-
blem. It allows you to integrate a lot of information into a small
space in an organized manner. Before creating a visualization,
129
130 Visualization and Communication
it is essential to structure your thoughts, define your objectives,
analyze the data, and identify the patterns that emerge. This re-
minds us of the words of John Tukey, a pioneer in data analysis :
“Graphs are not for proving anything ; they are for exploring the
data and asking the right questions” [Tukey, 1977]. Thus, visua-
lization should primarily be a tool for exploration.
A visual form without a purpose, story, or data could be a
sketch or even art. Data and a visual form without a purpose or
function are nothing more than smoke and mirrors. Data with a
purpose but no story or visual form are boring. All four elements
must be brought together to create an effective visualization.
Data visualization, sometimes called “data viz,” allows ana-
lysts to interpret data correctly. A good way to design data
visualization is to consider that it can make the difference bet-
ween total confusion and a real understanding of a problem. Ben
Shneiderman, one of the pioneers of the field, states, “Visuali-
zation is the key to making complex data accessible and useful
to a wide audience” [Shneiderman, 1996].
Data visualization is a valuable tool for integrating a lot of
information into a confined space. To do this, you must first
structure and organize your ideas. Think about your objectives
and the conclusions you have drawn after sorting the data. Then,
examine the patterns you have noticed in the data, the elements
that surprised you, and, of course, how all of this fits into your
analysis. Identifying the key elements of your results helps set
the stage for knowing how to organize your presentation. Funda-
mentally, an effective data visualization should allow those who
view it to reach the same conclusion you did, but much more
quickly.
7.2. The Impact of Visualization on Communication and
Decision Making 131
7.2 The Impact of Visualization on Commu-
nication and Decision Making
A famous anecdote in the field of data visualization is the
example of Florence Nightingale’s bar chart. In 1858, she used
a pie chart to illustrate the mortality rates in British military
hospitals during the Crimean War. This visualization had such
an impact that it contributed to the improvement of sanitary
conditions in British army hospitals. This example demonstrates
how a good visualization can not only change the way people
perceive a problem but also lead to concrete actions. As Hans
Rosling says, “Graphs are not just communication tools, but
also tools of persuasion” [Rosling, 2018].
Decisions based on clear and well-structured visualizations
are often more effective than those based on raw data. Decision-
makers can more easily understand trends and patterns when
they are presented visually. This is especially true in areas like
public health, where simple visualizations can communicate cru-
cial information about disease spread or treatment effectiveness.
For example, heat maps showing the density of COVID-19 cases
were crucial for guiding lockdown decisions and resource distri-
bution.
Moreover, data visualization not only facilitates understan-
ding but also collaboration. It allows different stakeholders to
discuss data on equal footing, relying on a common and clear
representation. In a professional setting, this can enable mul-
tidisciplinary teams to quickly converge on well-informed deci-
sions.
To conclude, data visualization is much more than just a
method of presentation ; it is a powerful way to reveal hidden
insights and help decision-makers see invisible patterns in the
numbers. Thanks to quotes and examples from pioneers in the
field, we understand the crucial importance of well-thought-out
visual representation. As Edward Tufte said so well, “A picture
132 Visualization and Communication
is worth a thousand words, but a good visualization is worth
much more” [Tufte, 2001].
7.3 Design Thinking
Design thinking is an iterative, user-centered approach to
solving complex problems, emphasizing empathy, collaboration,
and experimentation. This methodology is particularly useful
when designing effective and accessible data visualizations. Here
are the five phases of design thinking applied to data visualiza-
tion [Brown, 2009] :
1. Empathy : Understanding the emotions, needs, and ex-
pectations of your target audience is essential. This means
putting yourself in the users’ shoes to know which types
of visualizations will help them intuitively and relevantly
understand the data [Norman, 2013].
2. Definition : Precisely determine the purpose of the vi-
sualization. What question do you want your audience to
answer with the data ? What conclusions do you want the
viewer to draw ? This phase clarifies the objectives and
establishes the key messages to convey. It may include
identifying the most relevant variables and determining
how they should be visually highlighted.
3. Ideation : This is the creative phase where you explore
various approaches and formats for representing your
data. By collaborating with your team or looking for ins-
piring solutions in other sectors, you can generate innova-
tive ideas to visualize data more effectively [IDEO, 2015].
4. Prototyping : Create quick prototypes of the proposed
visualizations. These prototypes don’t need to be per-
fect but should provide a basis on which you can receive
concrete feedback [Brown, 2009].
5. Testing : The final phase involves testing your proto-
types with a representative sample of your target au-
7.4. Good and Bad Graphic Representation 133
dience. Gather feedback on the effectiveness of the vi-
sualizations, how easily users can interpret the data, and
their visual impact [Norman, 2013].
Design thinking, by integrating user needs throughout the
design process, ensures that visualizations are not only aesthetic
but also bring real value by facilitating understanding and data
analysis [IDEO, 2015].
7.4 Good and Bad Graphic Representation
Creating effective data visualizations relies on a combina-
tion of clarity, audience adaptation, and rigor in representation.
A good graphic representation has distinct characteristics that
maximize its communicative value [Tufte, 2001].
7.4.1 A good graphic representation is characterized
by :
— Clarity and conciseness : A good visualization must
have a direct message, avoiding unnecessary information.
It should express the data in an understandable way, with
clearly defined titles and axes so that the viewer can easily
grasp the main insights [Tufte, 2001].
— Based on solid data : Every visualization should be ba-
sed on reliable and compelling data. Using relevant and
appropriate graphs helps illustrate trends and relation-
ships in the data. For example, a scatter plot is perfect
for showing the relationship between two variables, while
a line graph is more suited for representing time series
[Few, 2012].
— Adapted to the audience : The design should take
the target audience into account. Complex graphs may
be suitable for experts, while simpler visuals and expla-
natory captions are preferable for non-expert audiences
[Knaflic, 2015].
134 Visualization and Communication
— Logical structure : Organize your visualizations logi-
cally. This starts with an explicit title, well-labeled axes,
and a clear hierarchy in the information [Tufte, 2001].
— Visually appealing : A good graphic representation at-
tracts attention while remaining functional. Use colors,
shapes, and sizes strategically to make the analysis vi-
sually pleasing [Few, 2012].
7.4.2 On the other hand, a bad graphic representation
has the following characteristics :
— Confusing message : If the visualization presents too
much contradictory information or if the visual elements
are poorly structured, the message can become difficult
to understand [Tufte, 2001].
— Insufficient or incorrect data : A poor presentation
can result from selecting the wrong data or making in-
correct interpretations [Few, 2012].
— Disregard for the audience : A graphic that is too
technical, lacking clear explanations or context, can leave
the audience in the dark [Knaflic, 2015].
— Inconsistent structure : A bad graphic can be disorga-
nized, with misplaced titles, poorly labeled axes, or mi-
saligned visual elements [Tufte, 2001].
— Unengaging visuals : Dull, poorly designed, or unne-
cessary visuals can make a data presentation less capti-
vating and memorable [Few, 2012].
7.5 Visualization Methods
Data visualization is a complex field that requires a thought-
ful approach to creating informative and engaging visual ele-
ments. Visualizations not only simplify data analysis but also
make complex information accessible to a wide audience. Se-
veral methods and conceptual frameworks can help you design
7.5. Visualization Methods 135
effective data visualizations. In this section, we will explore two
of these methods in detail : the McCandless method and the
verification of the Junk Charts Trifecta by Kaiser Fung.
7.5.1 The McCandless Method
The McCandless method, developed by David McCandless,
is a four-point framework for creating impactful data visualiza-
tions. This framework is based on the idea that well-chosen data,
presented clearly and aesthetically, can convey powerful and me-
morable stories. Here is a detailed analysis of each element of
this method :
— Information (data) : The information you are trying
to convey is at the core of any data visualization. Care-
fully choose the relevant data to include, ensuring that
it is accurate, current, and suitable for the objective you
wish to achieve. Data quality is essential. As [Tufte, 2001]
states, the truth in the data must be the highest priority,
as any error in the information destroys the credibility of
the visualization.
— Story (concept) : The story you tell through your
visualization is essential to captivating your audience.
Develop a clear narrative concept that guides viewers
through your data. Use titles, captions, and annotations
to explain the context and conclusions you want to share.
According to [Tukey, 1977], effective visualizations are
those that lead viewers to understand and discover the
hidden story in the data.
— Purpose (function) : The purpose of your visualization
should be clear and defined. What action or understan-
ding do you want your audience to take from the visua-
lization ? Design your visual element based on this goal,
highlighting the most relevant information and elimina-
ting the unnecessary. Visualization should never be an
end in itself, but rather a means to clarify and reinforce
136 Visualization and Communication
the message. As [Shneiderman, 1996] writes, a good visua-
lization has a clear purpose and helps in decision-making
or revealing clear insights.
— Visual Form (metaphor) : The visual form of your gra-
phic element is crucial to making your data accessible and
engaging. Choose the best visual metaphor to represent
your data, whether it be graphs, charts, maps, or other
elements. Ensure that the presentation is aesthetically
pleasing, organized, and easy to read. As [Tufte, 2001]
highlights, the visual form should be guided by the infor-
mation and never obstruct its understanding with unne-
cessary decorations.
By integrating these four elements into your creation pro-
cess, you can develop powerful and persuasive data visualiza-
tions that clearly communicate complex ideas while engaging
your audience.
7.5.2 Verification of the Junk Charts Trifecta by Kai-
ser Fung
The "Verification of the Junk Charts Trifecta" is a conceptual
framework proposed by Kaiser Fung in his book "Numbers Rule
Your World : The Hidden Influence of Probability and Statistics
on Everything You Do". This term refers to a set of three essen-
tial questions to ask when creating a data visualization to avoid
common mistakes that lead to misleading or useless charts, often
called "junk charts".
This approach is useful for evaluating the quality of your
visualizations. It is based on three fundamental questions, each
of which is a key step in creating effective visualizations :
1. What is the practical question ? Before creating a vi-
sualization, clearly define the practical question you want
to solve or the information you want to communicate.
Identify the needs and expectations of your target au-
dience. As [Fung, 2011] emphasizes, every visualization
7.6. Exploitation of Pre-Attentive Attributes in Data
Visualization 137
should begin with a clear question that guides the design
and presentation of the data.
2. What do the data say ? Ensure that your data is pro-
perly analyzed and interpreted to answer the practical
question. Apply appropriate statistical techniques and
ensure that the data is presented accurately and trans-
parently. Visualizations should be based on rigorous ana-
lyses, not on subjective interpretations (cf. [Tufte, 2001]).
3. What does the visual element say ? Examine how
the visual element you have created communicates the
data effectively to viewers. Ensure that the visual ele-
ment supports understanding of the practical question
and the data using thoughtful choices of color, layout,
and design. As [Fung, 2011] suggests, the visual element
must facilitate decision-making by clarifying key points in
the data.
By thoroughly answering these questions, you can assess the
effectiveness of your data visualization from the perspective of
your target audience. These methods, combined with thoughtful
reflection on your goals and conclusions, will help you create
compelling and informative data visualizations.
7.6 Exploitation of Pre-Attentive Attributes
in Data Visualization
Creating effective visual elements relies on a good unders-
tanding of human cognitive processing. Pre-attentive attributes,
which refer to visual characteristics that the human brain ins-
tantly identifies without conscious effort, are crucial elements
for making data quickly understandable. These attributes help
attract attention and organize information in an intuitive way,
thereby facilitating quick reading and decision-making. The two
main factors influencing visual effectiveness are marks and chan-
nels [Cleveland, 2001, Tufte, 1990, Bertin, 2010].
138 Visualization and Communication
7.6.1 Visual Marks
Marks are the basic elements used to represent data in a
visualization. They come in four essential visual qualities that
help convey information :
— Position : The position of elements in space plays a key
role in organizing information. For example, data positio-
ned higher on a graph generally indicates a higher value
[Tufte, 1990].
— Size : The size of visual elements, whether points, bars,
or lines, is used to express quantitative differences. Larger
or smaller values can be easily identified by changes in the
size of the marks [Bertin, 2010].
— Shape : Distinct shapes allow for the visual categoriza-
tion of data. For instance, circles may represent one cate-
gory, and squares another, making it easier to distinguish
between different types of information [Cleveland, 2001].
— Color : Color is a powerful attribute that helps differen-
tiate elements or highlight particular values. It can guide
the viewer to points of interest by drawing attention to
the most important elements [Tufte, 1990].
7.6.2 Visual Channels
Channels are the visual variables through which data is re-
presented. Each channel has its effectiveness in communicating
data depending on specific criteria :
— Precision : Some channels allow for more precise rea-
ding of data than others. For example, using lengths or
positions on a continuous scale allows for perceiving fine
differences between values, which is particularly impor-
tant for numerical data [Bertin, 2010].
— Display : Certain visual channels make it easier to dis-
tinguish between different values. Variations in color or
size, for instance, can make it easier to spot clear diffe-
rences between data [Tufte, 1990].
7.7. Geomatics and Spatial Data Representation 139
— Grouping : Channels can also be used to show relation-
ships or groupings of data. The use of distinct shapes or
colors can visualize categories or data clusters, which is
essential for identifying trends or significant groups wi-
thin the data [Cleveland, 2001].
By understanding how each mark and visual channel affects
data perception, it becomes possible to design data visualiza-
tions that not only grab attention but also facilitate rapid analy-
sis and decision-making. The judicious use of these pre-attentive
attributes allows transforming complex data sets into clear and
accessible representations, promoting instant understanding.
7.7 Geomatics and Spatial Data Representa-
tion
Geomatics, which encompasses the collection, processing,
and analysis of geographic data, heavily relies on data visua-
lization to aid decision-making. One of the most obvious ap-
plications of data visualization in this field is mapping, which
makes complex information accessible and understandable to a
wide audience [et al., 2005].
Spatial representations, whether in the form of maps or 3D
visualizations, allow geographic data to be contextualized by
placing it within a real-world space. This offers a perspective
that is not only informative but also strategic. For example, in
the field of natural resource management, temperature or preci-
pitation maps over long periods help identify climate trends and
predict impacts on local ecosystems [Kraak, 2011].
7.7.1 The Importance of Spatial Visualization
Visualizing geospatial data has several advantages. First, it
allows relationships between different elements of an environ-
ment to be represented intuitively. A simple map can reveal
patterns or anomalies that raw data tables would be unable to
140 Visualization and Communication
show. For example, in urban planning, the use of heat maps to
represent population density can help identify areas underser-
ved by public infrastructure or green spaces, facilitating more
effective urban planning [TomTom, 2015].
Geographic Information Systems (GIS) allow for the integra-
tion and overlaying of different layers of geospatial data, such
as roads, buildings, residential areas, and natural environments.
Visualizing this information in three-dimensional space provides
a richer understanding of spatial dynamics and the impact of po-
litical or economic decisions [et al., 2005].
7.7.2 Practical Applications of Spatial Visualization
Geomatics applications are vast, ranging from disaster mana-
gement to transportation management to precision agriculture.
For example, visualizing the most trafficked roads in a city using
an interactive map enables local authorities to better understand
traffic flow and plan transportation infrastructure accordingly
[Kraak, 2011]. Similarly, farmers using geospatial data can vi-
sualize soil moisture or crop health on maps to make informed
decisions on irrigation and pesticide use [TomTom, 2015].
Spatial data representation, through mapping and geospa-
tial analysis, thus serves as a crucial tool for understanding,
interpreting, and making informed decisions in complex, multi-
dimensional contexts.
7.8 Design Principles
When creating data visualizations, it is essential to follow
certain design principles to ensure effectiveness, clarity, and vi-
sual engagement. These principles help ensure that the infor-
mation is easily understandable and that the viewer’s attention
is appropriately guided. Here are the fundamental design prin-
ciples to keep in mind when creating visualizations :
Balance : A balanced design ensures a harmonious distri-
7.8. Design Principles 141
bution of visual elements such as color, shape, and graph
size. This does not necessarily mean the elements must
be symmetrical, but rather that no part of the visuali-
zation should dominate excessively. A well-designed ba-
lance helps maintain attention on the entire visualization
without causing distractions. For example, evenly distri-
buting colors among the elements of a graph prevents any
one element from overly capturing the viewer’s attention.
Emphasis : Emphasis involves drawing attention to the
most relevant and important data using visual elements
like contrasting colors or increased shape sizes. This al-
lows the viewer to quickly identify key information. For
example, in a sales graph, you can highlight the best-
performing category using a bright color, while other ca-
tegories are shown in more neutral shades. This visual
hierarchy guides the viewer and allows them to focus on
what matters most.
Movement : Movement in a visualization refers to how
the viewer’s eye moves through the visual elements. Use
guides, arrows, or smooth transitions to direct the vie-
wer’s attention from one point to another. Movement can
also be symbolic, showing a trend in the data, like an up-
ward trend line that directs the eye to the progress over
time. This principle is crucial for effectively guiding the
reading of the visualization, especially in interactive or
animated visualizations.
Pattern : Create visual patterns using similar shapes, sizes,
or colors to signal relationships or categories. This helps
reinforce the structure of the visualization and makes
comparison between elements more intuitive. For ins-
tance, in a bar chart, using similar colors for each cate-
gory can help maintain visual continuity, making it easier
for the viewer to understand. Conversely, a change in the
visual pattern, such as using a distinct color for a parti-
142 Visualization and Communication
cular category, can serve to highlight an important point
or anomaly.
Repetition : Repeating certain visual elements such as co-
lors, shapes, or graph types helps strengthen the organi-
zation and readability of the data. It creates coherence
in the visualization, helping the viewer become familiar
with the structure of the graph and easily understand
underlying patterns. For example, if several graphs show
time series, using the same format or color for trend lines
in each graph strengthens comparability. Repetition also
creates a visual rhythm, making the visualization less
cluttered and easier to follow.
7.9 Titles and Labels
One of the fundamental aspects of making a data visualiza-
tion effective is the use of appropriate titles and labels. While
seemingly simple, these elements play a crucial role in how the
information is perceived by the viewer. They facilitate quick un-
derstanding of the data and provide essential contextual framing
for the correct interpretation of the results presented.
Title : Typically placed at the top of the visualization, the
title serves to provide a brief overview of the subject being ad-
dressed. It should be clear, concise, and precise. A good title does
not just describe what is being represented but also guides the
viewer’s attention to the main purpose of the visualization. For
example, in a global temperature analysis, a title like "Global
Temperature Trends from 1900 to 2024" immediately indicates
to the audience the type of data and the time period being ana-
lyzed.
Labels : Labels are key elements to make the data accessible
and understandable. They are particularly important for descri-
bing the axes (X and Y) of graphs or for identifying categories
in pie charts. A well-placed label helps the viewer understand
7.10. Relevance of Visualization 143
what each point or segment of the visualization represents. For
example, in a bar chart showing sales by region, clear labels on
the axes indicating "Sales (in thousands)" and "Regions" provide
immediate context and improve readability.
Legend : A legend or key is often used to explain the mea-
ning of different visual elements, such as colors, shapes, or sizes.
For example, in a population density map, each color might re-
present a range of values. Although the legend is essential for
interpreting the data, it is important to position it wisely and
make it easily identifiable to avoid the viewer having to move
back and forth between the legend and the visualization.
Choice between Labels and Legend : The choice bet-
ween direct labeling and using a legend depends on the com-
plexity of the visualization and the need for clarity. Labels are
ideal for simple graphs, while legends are preferred in complex
visualizations where multiple categories or variables need to be
explained.
7.10 Relevance of Visualization
The relevance of a visualization lies in its ability to appro-
priately reflect the patterns or information you wish to commu-
nicate. A relevant visualization is one that allows for meaning-
ful conclusions while making the data accessible and understan-
dable to the target audience.
The types of patterns to visualize may vary, but each must
be suited to a specific goal and a precise message you want to
convey. Here are some common patterns in data visualization :
Change : Change refers to the evolution of data over time.
Line graphs or bar charts are ideal for illustrating tem-
poral trends, such as the evolution of product sales over
months or years. These visualizations allow for quick cap-
ture of changes and fluctuations in the data.
Grouping : Grouping is used when there are distinct groups
144 Visualization and Communication
or categories of data. Histograms or box plots are per-
fect for showing the distribution or dispersion of data wi-
thin different categories, such as visualizing student per-
formance in different schools or income across different
population segments.
Proportionality : Proportionality involves comparing the
proportions or parts of a whole. Pie charts are frequently
used to represent market share distribution or the demo-
graphic composition of a population. However, the use
of pie charts can be confusing if too many segments are
presented or if the differences are subtle.
Ranking : Ranking is often used to represent hierarchical
or ordered data. Column charts or bar charts are per-
fect for comparing items based on their rank, such as the
ranking of companies by revenue or countries by popula-
tion.
Correlation : Correlation shows the relationships between
two or more variables. Scatter plots are ideal for ob-
serving whether two variables are related, such as the
association between temperature and energy consump-
tion. These visualizations help detect linear or nonlinear
trends, as well as potential anomalies in the data.
The choice of the visualization type depends on several fac-
tors, including the nature of the data, the communication ob-
jective, and the desired level of complexity. A well-chosen visua-
lization not only facilitates understanding of the data but also
enables quick identification of key information, which is essential
for making informed decisions. Therefore, it is crucial to select
the visual method that best highlights the relevant patterns or
insights for your target audience.
7.11. Creating an Effective Dashboard 145
7.11 Creating an Effective Dashboard
In data analysis, a dashboard is an essential tool for tracking
key performance indicators (KPIs) and metrics. It provides a
concise visual overview of the performance of an organization,
operational unit, or specific process. Design thinking, in parti-
cular, is a fundamental approach for creating visualizations that
meet users’ needs, ensuring that the data is not only accurate
but also understandable and useful. As Tim Brown explains,
"Design thinking focuses on empathy, collaboration, and expe-
rimentation—principles that are particularly valuable when it
comes to creating data interfaces" [Brown, 2009].
Here are the principles for creating an effective dashboard :
— Clear objective : Precisely define the goal of the dash-
board and the KPIs you want to track. This could include
sales performance, customer satisfaction, or the efficiency
of an operational process.
— Reliable data sources : Identify and collect relevant
data, ensuring that it is reliable, up-to-date, and repre-
sentative. Make sure it is also accessible for real-time use.
— Intuitive design : Design an interface that allows
users to quickly understand the data and make infor-
med decisions. This requires selecting the right visuali-
zations (charts, tables, diagrams) and arranging them in
a consistent and logical manner.
— Real-time updates : If possible, set up the dashboard
to display real-time information, improving quick and ef-
fective decision-making.
— Customization : Offer the ability to customize the dash-
board to meet the specific needs of each user. A flexible
dashboard allows each user to access the most relevant
information for them.
— Collaboration : Facilitate data sharing and collabora-
tion by allowing users to comment, analyze, and discuss
directly on the dashboard.
146 Visualization and Communication
Well-designed dashboards improve decision-making, provide
real-time tracking, and foster enhanced collaboration among sta-
keholders.
7.12 Choosing the Right Chart
Choosing the right chart is a fundamental aspect of data
representation. Indeed, poor chart selection can lead to misin-
terpretation or loss of important information. Here are some
recommendations based on data visualization principles.
7.12.1 Histogram and Density Plot
A histogram is often the best choice for representing the
distribution of a single continuous numeric variable. It visualizes
the frequency of values across different ranges. The density plot,
on the other hand, is smoother and highlights general trends by
smoothing the data distribution [Tufte, 2001].
7.12.2 Bar Chart
Bar charts are particularly suited for comparing discrete ca-
tegories or distinct groups. This type of chart is useful when
you want to observe differences between various data categories,
such as sales by region or performance by department.
7.12.3 Line Chart and Pie Chart
Line charts are ideal for representing trends over time, es-
pecially when there are multiple data sets. They allow you to
connect several data series on the same line to observe the evo-
lution of values. On the other hand, the pie chart is effective
for illustrating the composition of a data set and showing the
proportions between different categories [Few, 2012].
7.13. The Art of the Slide Show 147
7.12.4 Best Practices
To ensure the effectiveness of a chart, it is crucial to follow
certain best practices :
— Clarity : Every chart should communicate a clear and
precise message. Avoid unnecessary decorative elements
and ensure that the axes are well-labeled.
— Relevance : Choose the type of chart based on the ob-
jective. For example, a bar chart is more suitable for ca-
tegorical comparisons, while a line chart is more effective
for temporal trends.
— Simplicity : Do not overload your charts with too much
data. A simple chart allows for better understanding and
has a greater impact.
7.13 The Art of the Slide Show
Creating an effective slide show relies on careful planning and
good use of visuals. The key principle is to make the information
clear and easily understandable. Design thinking is also a useful
approach in this phase, as it emphasizes clarity and adaptability
to the needs of the target audience. Here are some tips for a
successful presentation :
— Defined Objective : Before creating your slide show,
clarify your objective : to inform, persuade, or enter-
tain. This will guide the structure of your presentation
[Knaflic, 2015].
— Consistent Theme : Choose a theme that supports your
message and aligns with the tone of your presentation.
Visual consistency enhances the impact of your message.
— Visual Simplicity : Limit the text and information on
each slide. Use high-quality images and relevant graphics
to support your points without overwhelming the au-
dience [Tufte, 2001].
— Practice : Rehearse your presentation several times to
148 Visualization and Communication
ensure it flows smoothly and is well-structured.
By following these principles, you will create a captivating
and effective slide show that conveys your message impactfully.
7.14 Presenting Results
When a data analyst shares their results, clarity and com-
munication are essential to ensure that the information is not
only understandable but also relevant to the audience. A good
presentation of results relies on the use of effective visuals and
appropriate communication. Here are tips for presenting results
successfully.
7.14.1 Tips for a Successful Presentation
— Understand Your Audience : Adjust the complexity
of your presentation according to the expertise level of
your audience. For example, a technical audience may
understand complex terms, while a non-expert audience
will need simpler explanations.
— Use of Visuals : Well-designed graphs and tables help
facilitate understanding of the results and highlight key
points [Knaflic, 2015].
— Avoid Jargon : Use simple language and avoid technical
terms that may obscure the clarity of your message.
— Logical Structure : Results should be presented in a
structured way : start with a clear introduction, followed
by data, and conclude with key observations and recom-
mendations.
7.14.2 Anticipating Objections
When presenting your results, it’s likely that objections will
arise, whether regarding methodology, data, or interpretation.
To manage these objections effectively, here are some steps to
follow :
7.14. Presenting Results 149
1. Anticipate Objections : Before the presentation, iden-
tify potential objections by considering the weaknesses in
your data and the limitations of your methodology.
2. Prepare Strong Responses : Prepare well-argued
responses, supported by concrete data and relevant
examples.
3. Stay Calm and Professional : When faced with an
objection, listen attentively and respond respectfully wi-
thout becoming defensive.
4. Use Visualizations : Use charts to illustrate your res-
ponses and clarify your points.
5. Acknowledge Limitations : If an objection highlights
a valid limitation in your analysis, acknowledge it and
explain how you handled it.
6. Encourage Discussion : If an objection leads to a pro-
ductive discussion, encourage it. This can enrich your
conclusions.
7.14.3 Anticipating Questions
In addition to objections, audience questions are an impor-
tant aspect of the presentation. Here are strategies for answering
them effectively :
1. Think Like the Audience : Put yourself in the shoes
of your audience to anticipate their concerns.
2. Prepare Answers : Prepare answers to the most likely
questions based on solid data.
3. Admit Uncertainty : If a question stumps you, don’t
hesitate to admit that you don’t know, but make sure to
research the information if necessary.
4. Encourage Questions : Encourage the audience to ask
questions after the presentation, showing that you are
open to discussion.
150 Visualization and Communication
5. Manage Time : If time is limited, answer the essential
questions first and offer to discuss in more detail after the
presentation.
6. Stay Focused on the Objective : Ensure your answers
help clarify the results and stay focused on the main to-
pic.
7.14.4 Citations and Failure Cases
The importance of managing result presentations well is
highlighted by several studies and examples where poor handling
of objections or insufficient answers led to failures. For example,
during a presentation of results from a market data analysis, the
analyst neglected to adequately respond to questions about the
methodology used to collect the data. This led to a loss of trust
from the audience and undermined the credibility of the analy-
sis [Meyer, 2014]. Another notable case is that of a technology
company that, during the presentation of an interactive dash-
board for a management team, underestimated the importance
of clearly explaining the limitations of real-time data. The lack of
transparency raised doubts about the reliability of the informa-
tion, leading to slower adoption of the system [Thompson, 2020].
7.15 Example of the Bike Rental Project
The goal of this presentation is to share the results from
the analysis of electric bike data. We studied several aspects
of bike usage, such as trip durations, geographic distribution
of departures, and peak hours. This information is essential for
optimizing fleet management.
7.16 Analysis Code
Here is the Python code used to generate the charts from the
data analysis.
7.17. Distribution of Bike Types 151
Listing 7.1 – Loading Libraries and Data
import pandas as pd
import matplotlib . pyplot as plt
import folium
from folium . plugins import HeatMap
# Loading data ( ensure your DataFrame ' df ' contains the
correct columns )
df = pd . read_csv ( ' bike_data . csv ')
7.17 Distribution of Bike Types
To visualize the distribution of bike types used, we generated
a bar chart showing the popularity of electric bikes compared to
standard bikes. The Python code below generates this chart.
Listing 7.2 – Distribution of Bike Types
plt . figure ( figsize =(8 , 6) )
df [ ' rideable_type ' ]. value_counts () . plot ( kind = ' bar ' ,
color = ' skyblue ')
plt . title ( ' Distribution of Bike Types ' , fontsize =16)
plt . xlabel ( ' Bike Type ' , fontsize =12)
plt . ylabel ( ' Number of Rides ' , fontsize =12)
plt . xticks ( rotation =45)
plt . tight_layout ()
plt . show ()
7.18 Distribution of Trip Durations
We calculated trip durations in minutes by subtracting the
start and end timestamps, then visualized this distribution in
the form of a histogram. The Python code for this is as follows.
Listing 7.3 – Distribution of Trip Durations
df [ ' started_at '] = pd . to_datetime ( df [ ' started_at ' ])
df [ ' ended_at '] = pd . to_datetime ( df [ ' ended_at ' ])
df [ ' ride_duration '] = ( df [ ' ended_at '] - df [ ' started_at '
]) . dt . total_seconds () / 60 # Convert to minutes
plt . figure ( figsize =(8 , 6) )
df [ ' ride_duration ' ]. plot ( kind = ' hist ' , bins =20 , color = '
lightgreen ' , edgecolor = ' black ')
152 Visualization and Communication
plt . title ( ' Distribution of Trip Durations ' , fontsize
=16)
plt . xlabel ( ' Duration ( minutes ) ' , fontsize =12)
plt . ylabel ( ' Frequency ' , fontsize =12)
plt . tight_layout ()
plt . show ()
7.19 Number of Rides by Member Type
This graph shows how many rides were taken by each type
of member (regular or casual). Below is the Python code to
generate this graph.
Listing 7.4 – Number of Rides by Member Type
plt . figure ( figsize =(8 , 6) )
df [ ' member_casual ' ]. value_counts () . plot ( kind = ' bar ' ,
color = ' salmon ')
plt . title ( ' Number of Rides by Member Type ' , fontsize
=16)
plt . xlabel ( ' Member Type ' , fontsize =12)
plt . ylabel ( ' Number of Rides ' , fontsize =12)
plt . xticks ( rotation =0)
plt . tight_layout ()
plt . show ()
7.20 Number of Rides by Time of Day
By analyzing the rides based on the time of day, we can
determine the peak activity periods. Here is the code for that.
Listing 7.5 – Number of Rides by Time of Day
df [ ' hour '] = df [ ' started_at ' ]. dt . hour
plt . figure ( figsize =(8 , 6) )
df . groupby ( ' hour ') . size () . plot ( kind = ' line ' , marker = 'o ' ,
color = ' orange ')
plt . title ( ' Number of Rides by Time of Day ' , fontsize
=16)
plt . xlabel ( ' Time of Day ' , fontsize =12)
plt . ylabel ( ' Number of Rides ' , fontsize =12)
plt . grid ( True )
plt . tight_layout ()
plt . show ()
7.21. Top 10 Start Stations 153
7.21 Top 10 Start Stations
We analyzed the most popular start stations by creating a
bar graph of the 10 most frequent stations. Below is the asso-
ciated code.
Listing 7.6 – Top 10 Start Stations
plt . figure ( figsize =(8 , 6) )
df [ ' s t a r t _ s t a t i o n _ n a m e ' ]. value_counts () . head (10) . plot (
kind = ' bar ' , color = ' lightcoral ')
plt . title ( ' Top 10 Start Stations ' , fontsize =16)
plt . xlabel ( ' Station Name ' , fontsize =12)
plt . ylabel ( ' Number of Rides ' , fontsize =12)
plt . xticks ( rotation =90)
plt . tight_layout ()
plt . show ()
7.22 Heatmap of Rides
Finally, to visualize the areas where rides start, we used a
heatmap generated with the folium library. This code creates
an interactive map showing areas with high ride density.
Listing 7.7 – Heatmap of Rides
m = folium . Map ( location =[ df [ ' start_lat ' ]. mean () , df [ '
start_lng ' ]. mean () ] , zoom_start =12)
locations = df [[ ' start_lat ' , ' start_lng ' ]]. dropna () .
values
HeatMap ( locations ) . add_to ( m )
m . save ( ' h ea tm ap _ tr aj et s . html ') # Save in HTML format
for visualization
The analysis of electric bike rides has highlighted interesting
trends regarding bike usage, member types, and the most popu-
lar stations. The results presented here can help optimize bike
and station management to better meet demand.
154 Visualization and Communication
Chapitre 8
Data Analysis and AI
8.1 Why Do We Analyze Data ? What We
Are Looking For and How to See Beyond
the Numbers
Data analysis is not limited to simply extracting numbers
or facts. Its aim is to extract valuable information, understand
phenomena, and sometimes predict events or prevent future pro-
blems. Data is the raw material from which we hope to discover
trends, anomalies, and hidden relationships that can guide our
decisions and strategies.
8.1.1 The Goal of Data Analysis
Data analysts primarily seek to answer specific ques-
tions or solve problems by exploring data. According to
[Davenport and Harris, 2007b], data analysis is a way to "trans-
form data into knowledge" which can then be used to make infor-
med decisions. This can include : - Forecasting future outcomes
(such as sales or customer behavior), - Understanding the root
causes of a problem (such as a low conversion rate on a website), -
Detecting patterns or trends that can improve an organization’s
strategy.
155
156 Data Analysis and AI
But the key is not just "seeing the data," it is "seeing beyond."
As emphasized by [Mayer-Schönberger and Cukier, 2013] in
their book *Big Data : A Revolution That Will Transform How
We Live, Work, and Think*, data analysis provides a more ac-
curate and complete view of past events, but also a perspective
on the future, sometimes with impressive precision.
8.1.2 Can We See Beyond the Data ?
One of the most ambitious goals of data analysis is to "see
beyond," meaning to anticipate the future or identify problems
before they occur. This ability relies on advanced analytical tech-
niques such as machine learning, neural networks, or predictive
analytics.
According to [Hubbard, 2014], the data analyst seeks to "le-
verage data to make predictions about future events" and the-
reby "reduce uncertainty" based on the available information.
For example, in healthcare, predictive models can identify di-
sease risks before symptoms manifest, offering opportunities for
preventive treatment.
This is also applicable in other fields like finance, where
AI can forecast economic crises or market fluctuations
[LeCun, 2015]. Historical data combined with learning models
can thus provide quite precise forecasts, sometimes to the point
of "seeing" potential changes or risks ahead.
8.1.3 Can Data Analysis Prevent a Problem ?
Anticipation is one of the primary reasons businesses hea-
vily invest in data analysis. For example, analyzing consumer
behavior on online platforms can help predict potential churn or
customer dissatisfaction and take action accordingly, even before
the customer makes a decision [Fader et al., 2016].
Furthermore, data analysis also enables real-time anomaly
detection and immediate corrective actions. Tools like fraud de-
tection systems use AI to identify unusual transactions that may
8.2. Artificial Intelligence in Data Analysis 157
indicate fraudulent activities, thereby preventing financial losses
before they occur.
In conclusion, data analysis goes far beyond simple num-
ber collection : it transforms these numbers into strategic in-
formation, identifies behavioral patterns, and enables informed
decision-making. With predictive capabilities and proactive pro-
blem detection, data analysts not only have the ability to better
understand past events but also anticipate and prevent certain
risks.
8.2 Artificial Intelligence in Data Analysis
Artificial intelligence (AI) is a field of computer science that
aims to create systems capable of simulating human intelligence,
particularly the ability to learn, reason, and solve problems.
Data analysis, on the other hand, is the process of extracting
knowledge and understanding from data, whether small or mas-
sive. AI plays a crucial role in this by automating, optimizing,
and enriching the work of data analysts.
8.2.1 Data Collection
AI is used to collect data from various sources, whether struc-
tured or unstructured. This includes data from sensors, social
media streams, online databases, and much more. AI also allows
the extraction of unstructured data (such as images, videos, or
texts) and transforms them into actionable data, providing a
wealth of information for analysis.
For example, AI-powered chatbots can interact with users
to gather valuable information and qualify leads, making data
collection more dynamic and fluid.
8.2.2 Data Preprocessing
Before analysis, data often needs to be cleaned and prepa-
red. AI, particularly machine learning, can automate these often
158 Data Analysis and AI
complex tasks. For example, algorithms can be used to : - Iden-
tify and correct outliers, - Fill in missing data using predictive
models, - Normalize and transform data to make it consistent
and ready for use.
These techniques allow data analysts to focus on the strategic
aspects of analysis by delegating repetitive tasks to automated
systems.
8.2.3 Data Analysis
One of AI’s major contributions to data analysis lies in its
ability to identify complex patterns, discover non-linear rela-
tionships, and make predictions with great accuracy. Through
supervised and unsupervised learning, AI can create models that
detect hidden trends in datasets, allowing for deeper insights.
AI algorithms can, for example : - Perform predictive ana-
lysis (e.g., predicting future sales based on historical trends), -
Segment customers based on purchase behaviors or demographic
data, - Detect anomalies or fraud in transactional data.
By using deep neural networks or decision trees, AI enhances
the ability to explore large volumes of data and draw conclusions
more quickly.
8.2.4 Data Visualization
Data visualization is a key element in effectively commu-
nicating the results of analyses. AI plays an important role in
creating interactive and dynamic visualizations, making it easier
to understand the relationships between variables and convey
results intuitively.
For instance, AI-powered tools can recommend the most re-
levant types of visualizations (correlation diagrams, heatmaps,
etc.) based on available data. AI can also generate 3D or in-
teractive visualizations to explore multidimensional data more
deeply.
Smart dashboards and interactive charts allow for faster and
8.3. Artificial Intelligence : Product of Data or Generator of
Knowledge ? 159
more informed decision-making, as stakeholders can explore data
in real-time.
8.2.5 Prediction and Modeling
One of AI’s most important contributions to data analysis
is its ability to make predictions about future events based on
data models. Predictive modeling algorithms, such as regres-
sions, random forests, or neural networks, allow data analysts
to predict outcomes with increased accuracy.
Applications of these models are vast and include :
— Forecasting sales or demand for efficient inventory mana-
gement,
— Estimating customer churn to optimize loyalty strategies,
— Detecting financial risks or fraud through anomaly detec-
tion models.
AI enhances data analysts’ capabilities by helping them
choose and refine the most suitable models for each use case.
8.3 Artificial Intelligence : Product of Data
or Generator of Knowledge ?
Artificial intelligence (AI) is often seen as an innovative tech-
nology that can perform complex tasks such as speech recogni-
tion, computer vision, or autonomous decision-making. Howe-
ver, behind this technological façade lies a deep and intimate
connection between data and AI. It is within the data that AI
takes root, forms, and evolves. AI, in its modern form, is thus a
product of the data it processes, learns from, and exploits. This
section aims to explore this connection from both a philosophical
and scientific perspective.
8.3.1 The Fuel of Artificial Intelligence : Data
Artificial intelligence, particularly in machine learning, feeds
on data. These are not merely objects of measurement but vital
160 Data Analysis and AI
elements that allow AI systems to learn and improve. Every AI
model, whether based on deep neural networks or decision trees,
requires vast amounts of data to be trained.
As [Goodfellow et al., 2016] emphasizes, deep learning algo-
rithms depend on massive data to adjust the weights of neural
networks and optimize their ability to make predictions. AI is,
in a sense, a "product of data" that evolves based on the infor-
mation it receives. This dynamic mirrors the idea that human
intelligence also develops from accumulated experience.
8.3.2 AI as an Extension of Data Analysis
Beyond the technical aspect, a significant philosophical point
is the idea that AI is an extension of human capacity to process
data. In this sense, AI represents an amplification of human
analytical capabilities. Rather than simply processing raw infor-
mation, AI transforms it into structured knowledge, often richer
and more complex than what a human analyst might grasp.
However, this raises a central question : to what extent can
AI, which depends on data for its construction, offer a truly
"objective" view of reality ? As [O’Neil, 2016] suggests, biases
present in the data can be reflected in AI models, leading to
unjust or erroneous results. This highlights the complex rela-
tionship between data, intelligence, and objectivity. If the data
is imperfect, AI, in processing it, risks reproducing those imper-
fections.
8.3.3 The Philosophy of AI and Data : The Question
of Autonomy and Creation
AI is sometimes seen as an entity capable of thinking and
acting independently of humans, but this view requires nuance.
[Boden, 2016], in *AI : Its Nature and Future*, explores this
idea, noting that while AI can learn autonomously, it is always
the product of the data it was trained on. In other words, AI is
inseparable from the data that fueled it, and its "creativity" (if
8.4. Applications of AI in Data Analysis 161
we can call it that) remains limited by the information available.
Thus, the question arises : can AI truly "see beyond" data,
or is it always confined within a framework determined by that
data ? An answer to this question may come from examining AI’s
ethics. According to [Binns, 2018], one of the major challenges
for AI lies in its ability to handle new and unexpected situations
not covered by past data. This raises a dilemma : if AI merely
"recycles" data, can it ever be truly autonomous or innovative ?
8.3.4 An Intelligence Based on Accumulation
Far from being an independent entity, artificial intelligence is
inextricably linked to all the data it consumes and transforms. It
is an iterative process where each AI model evolves from histori-
cal data, its structure, and its biases. Just as human intelligence
builds from experience and learning, AI depends on data to ac-
quire "intelligence" and face real-world challenges.
Ultimately, the relationship between AI and data could be
seen as a loop where AI, by learning from data, becomes more
efficient but also potentially more biased or limited by the im-
perfections in that data. This reflection opens the door to new
questions about the ethics and future of artificial intelligence,
inviting us to consider how we can better frame data use so that
AI truly serves human interests.
8.4 Applications of AI in Data Analysis
The fields of application for AI in data analysis are numerous
and varied. Here are some examples :
— Predictive Analysis : Using predictive models to esti-
mate future events, such as sales, demand, or customer
behavior.
— Sentiment Analysis : Exploiting textual data to un-
derstand opinions and emotions expressed in reviews, so-
cial media discussions, etc.
162 Data Analysis and AI
— Supply Chain Optimization : Applying algorithms to
predict stock needs and optimize logistics.
— Image and Video Recognition : Processing visual
data to extract useful information, such as facial recog-
nition or medical image analysis.
— Chatbots and Intelligent Assistants : Automating
interactions with customers and gathering data through
AI-powered conversational interfaces.
These applications show that artificial intelligence is not just
about analysis ; it plays a crucial role in transforming raw data
into actionable information, enabling organizations to make in-
formed decisions and better anticipate future trends.
8.5 Application of Artificial Intelligence to
Analysis : The Importance of Predictive
Analysis
Artificial intelligence (AI) has revolutionized data analysis
by expanding knowledge extraction capabilities and making it
possible to anticipate future events. Among its many applica-
tions, predictive analysis stands out as a key area, allowing or-
ganizations to transform data into strategic tools. This section
explores the role of AI in predictive analysis, highlighting its
benefits, limitations, and implications.
8.5.1 What is Predictive Analytics ?
Predictive analytics uses statistical techniques, machine lear-
ning algorithms, and mathematical models to identify trends
and predict future outcomes. According to [Shmueli, 2010], the
goal is to move from describing historical data to making fore-
casts that assist in making informed decisions. Unlike descriptive
analytics, which focuses on the past, predictive analytics seeks
to explore future scenarios based on past and current data.
8.5. Application of Artificial Intelligence to Analysis : The
Importance of Predictive Analysis 163
8.5.2 The Role of AI in Predictive Analytics
Artificial intelligence, particularly machine learning and deep
learning, plays a central role in predictive analytics. These tech-
nologies enable the construction of complex models capable of :
— Identifying hidden trends : Machine learning algo-
rithms analyze large datasets and detect patterns that
are not apparent to humans.
— Generating accurate forecasts : For example, recur-
rent neural networks (RNNs) are commonly used to pre-
dict time series, such as sales or market fluctuations.
— Automating and refining predictions : Through
techniques like cross-validation and hyperparameter tu-
ning, AI models can improve their performance without
human intervention.
As noted by [Min Chen and Liu, 2012], the integration of AI
enhances forecasting efficiency by reducing human biases and
increasing analysis speed.
8.5.3 Examples of AI-powered Predictive Analytics
Applications
Predictive analytics is used in various sectors to address spe-
cific needs :
— Healthcare : Predictive models help anticipate epide-
mics, diagnose diseases at early stages, or predict hospital
needs [et al., 2019].
— Finance : Financial institutions use predictive analytics
to detect fraud and predict credit risks [et al., 2016].
— Retail : Businesses use these models to optimize inven-
tory, predict customer behavior, and personalize offers
[Fader and Hardie, 2016].
8.5.4 Limitations and Challenges
Despite its effectiveness, AI-powered predictive analytics pre-
sents several challenges :
164 Data Analysis and AI
— Data quality : Predictive models rely on the quality and
relevance of the data used for training. Biased data can
lead to incorrect predictions.
— Complexity of algorithms : Sophisticated algorithms
can sometimes be "black boxes," making their decisions
difficult to interpret [Rudin, 2019].
— Ethical issues : Predictions-based decisions may
raise concerns about privacy and discrimination
[Barocas and Selbst, 2016].
Predictive analytics, powered by artificial intelligence, repre-
sents a major advancement in organizations’ ability to anticipate
and address future challenges. It transforms data into a strategic
lever, allowing for resource optimization, improved user expe-
rience, and risk reduction. However, to maximize its potential,
it is essential to address its technical and ethical challenges and
ensure that predictive models are used responsibly.
8.6 Prediction Models
Predictive analytics relies on various models that, depen-
ding on their design and objective, address specific needs in di-
verse fields such as finance, healthcare, and retail. These models
are based on statistical techniques, machine learning algorithms,
and more recent deep learning approaches.
8.6.1 Linear Regression
Linear regression is one of the simplest and most widely used
models in prediction. It aims to model the relationship between
a dependent variable (Y ) and one or more independent variables
(X) by fitting a linear equation. For example :
Y = β0 + β1 X1 + β2 X2 + · · · + βn Xn + ϵ
where β represents the coefficients, and ϵ is the residual error.
This model is used for forecasts such as sales trends or cost
8.6. Prediction Models 165
predictions. However, it becomes ineffective when the relation-
ships between variables are nonlinear [Montgomery et al., 2012].
8.6.2 Logistic Regression
Unlike linear regression, logistic regression is used for binary
classification problems. The model predicts the probability that
an observation belongs to a specific class (0 or 1). The sigmoid
function is at the core of this model :
1
P (Y = 1|X) =
1+ e−(β0 +β1 X1 +···+βn Xn )
This model is often used to predict outcomes such as churn
probability or fraud detection [Hosmer et al., 2013].
8.6.3 Random Forests
Random forests are an ensemble of supervised learning algo-
rithms based on decision trees. Each tree is built using a random
sample of the data and contributes to the final decision through
a majority vote.
This model is robust to overfitting and is used for
tasks such as customer segmentation and financial forecasting
[Breiman, 2001].
8.6.4 Artificial Neural Networks
Artificial neural networks (ANNs) mimic the structure of the
human brain to model complex relationships between variables.
They consist of layers of neurons (input, hidden, and output)
that transform data through weights adjusted during training.
An example of an ANN is the multilayer perceptron (MLP).
ANNs are widely used in computer vision, speech recogni-
tion, and time series forecasting [Goodfellow et al., 2016].
166 Data Analysis and AI
8.6.5 Support Vector Machines (SVM)
SVMs are supervised models that find an optimal hyper-
plane to separate classes in a high-dimensional space. They are
particularly effective for small datasets with well-defined class
margins.
For example, SVMs are used for image classification and
spam detection [Cortes and Vapnik, 1995].
8.6.6 Time Series Models : ARIMA and LSTM
Time series-specific models, such as ARIMA (AutoRe-
gressive Integrated Moving Average) and Long Short-Term
Memory (LSTM) neural networks, are used to predict fu-
ture values based on historical data. ARIMA is a statisti-
cal model, while LSTM, based on deep learning, is better
suited for complex sequential data, such as financial flows
or weather forecasting [Hyndman and Athanasopoulos, 2008,
Hochreiter and Schmidhuber, 1997].
8.6.7 Applications and Limitations of Predictive Mo-
dels
These models have revolutionized predictive analytics, but
they also present challenges such as :
— The need for high-quality data.
— High computational costs for complex models such as
ANNs and random forests.
— The risk of bias if training data is unrepresentative
[Barocas and Selbst, 2016].
8.7 Prediction of Pastry Sales
This section presents an example of predicting pastry sales
based on the days of the week. The simulated data is generated
in Python and used to train a predictive model.
8.8. Dataset 167
8.7.1 Dataset
The table below shows an example of a simulated dataset :
Day Pastry Sales
Sunday Eclair 162
T hursday Croissant 111
F riday Danish 101
Sunday Danish 61
W ednesday M uf f in 88
These data represent the average sales for different types of
pastries based on the day.
8.7.2 Python Code to Generate the Data
Here is the Python code used to generate a simulated data-
set :
In this example, we will predict the sales of different types
of pastries based on the days of the week, weather, and special
occasions (e.g., national holidays or school vacations). We will
use a dataset that integrates these variables to illustrate the
impact of external factors on sales.
8.8 Dataset
Here is a table showing the pastry sales based on the days of
the week, weather, and occasions :
168 Data Analysis and AI
Day Pastry Sales Weather Occasion
Sunday Eclair 162 Sunny N ationalHoliday
T hursday Croissant 111 Rain SchoolHolidays
F riday Danish 101 W ind ReligiousHoliday
Sunday Danish 61 Sunny N ationalHoliday
W ednesday M uf f in 88 Rain N oEvent
M onday Donut 133 Sunny ReligiousHoliday
Saturday Eclair 190 Cloudy SchoolHolidays
T uesday Croissant 120 Rain N oEvent
T hursday M uf f in 78 W ind N ationalHoliday
F riday Danish 154 Sunny SchoolHolidays
W ednesday Donut 102 Rain ReligiousHoliday
Sunday M uf f in 75 W ind N oEvent
8.9 Python Code
The following Python code generates random pastry sales
based on the days of the week, weather, and occasions, then
saves the results to a CSV file.
import pandas as pd
import random
# Define the days of the week , types of pastries ,
weather , and occasions
days = [ ' Monday ' , ' Tuesday ' , ' Wednesday ' , ' Thursday ' , '
Friday ' , ' Saturday ' , ' Sunday ']
pastries = [ ' Croissant ' , ' Muffin ' , ' Donut ' , ' Eclair ' , '
Danish ']
weather = [ ' Sun ' , ' Rain ' , ' Wind ' , ' Cloudy ']
occasions = [ ' National Holiday ' , ' School Holidays ' , '
Religious Holiday ' , ' No Event ']
# Generate random data
data = []
for _ in range (12) : # 12 rows instead of 100
day = random . choice ( days )
pastry = random . choice ( pastries )
sales = random . randint (50 , 200)
meteo = random . choice ( weather )
occasion = random . choice ( occasions )
8.10. Prediction using Linear Regression 169
data . append ({ ' Day ': day , ' Pastry ': pastry , ' Sales ':
sales , ' Weather ': meteo , ' Occasion ': occasion })
# Convert to DataFrame
df = pd . DataFrame ( data )
print ( df . head () )
# Save as a CSV file
df . to_csv ( " p a s t r y _ s a l e s _ w i t h _ c o n s t r a i n t s . csv " , index =
False )
Once the data is generated and saved in a CSV file, it can be
used to train a prediction model. For example, a random forest
model can be used to predict the number of sales based on the
day and type of pastry.
8.10 Prediction using Linear Regression
Linear regression is a fundamental statistical model used to
predict a continuous variable from independent variables. The
goal of linear regression is to find the most appropriate linear
relationship between the target variable (Y ) and the explanatory
variables (X1 , X2 , . . . , Xn ).
8.10.1 Linear Regression Function
Linear regression models the relationship between a de-
pendent variable Y and independent variables X1 , X2 , . . . , Xn
through the following equation :
Y = β0 + β1 X1 + β2 X2 + · · · + βn Xn + ϵ
where :
— Y is the variable to predict,
— X1 , X2 , . . . , Xn are the explanatory variables (e.g., day of
the week, weather, etc.),
— β0 is the intercept,
— β1 , β2 , . . . , βn are the regression coefficients, which mea-
sure the impact of each explanatory variable on the de-
pendent variable,
170 Data Analysis and AI
— ϵ is the residual error, representing the differences bet-
ween observed and predicted values.
8.10.2 Example : Predicting Pastry Sales
In this example, we will predict pastry sales (Y ) based on
several explanatory variables : the day of the week, weather, and
special occasions (such as national holidays or school vacations).
We have the following data :
Day Weather Occasion Sales
1 1 0 100
2 0 1 120
3 0 1 150
4 1 0 130
5 1 0 80
6 0 1 110
7 1 0 170
1 1 1 160
2 0 1 140
3 1 0 130
4 0 1 90
5 1 0 120
In this example, we use linear regression to predict the va-
riable Sales based on the explanatory variables : Day, Weather,
and Occasion.
8.10.3 Python Code for Linear Regression
The following code illustrates how to perform linear regres-
sion in Python using the scikit-learn library :
Listing 8.1 – Python Code for Linear Regression
import pandas as pd
import numpy as np
from sklearn . linear_model import L i n e a r Re g r e s s i o n
import matplotlib . pyplot as plt
8.10. Prediction using Linear Regression 171
# Create the dataset
data = {
' Day ': [1 , 2 , 3 , 4 , 5 , 6 , 7 , 1 , 2 , 3 , 4 , 5] ,
' Weather ': [1 , 0 , 0 , 1 , 1 , 0 , 1 , 1 , 0 , 1 , 0 ,
1] ,
' Occasion ': [0 , 1 , 1 , 0 , 0 , 1 , 0 , 1 , 0 , 1 , 0 ,
0] ,
' Sales ': [100 , 120 , 150 , 130 , 80 , 110 , 170 ,
160 , 140 , 130 , 90 , 120]
}
df = pd . DataFrame ( data )
# Separate independent variables ( X ) and dependent
variable ( Y )
X = df [[ ' Day ' , ' Weather ' , ' Occasion ' ]] # Independent
variables
Y = df [ ' Sales '] # Dependent variable
# Create and train the linear regression model
model = L i n e ar R e g r e s s i o n ()
model . fit (X , Y )
# Predict sales for the same data ( for example )
predictions = model . predict ( X )
# Display coefficients
print ( f " Coefficients : { model . coef_ } " )
print ( f " Intercept : { model . intercept_ } " )
# Plot predictions vs real values
plt . scatter (Y , predictions )
plt . xlabel ( ' Actual Sales ')
plt . ylabel ( ' Predictions ')
plt . title ( ' Actual vs Predictions ')
plt . show ()
8.10.4 Results
By running this code, we obtain the coefficients of the li-
near regression model as well as the intercept. These values help
us understand how each variable (Day, Weather, Occasion) in-
fluences the pastry sales.
For example, the results may indicate that **sales slightly
increase** as the days of the week progress, that **sunny wea-
ther** boosts sales, and that **special occasions** (such as ho-
lidays) also lead to an increase in sales.
172 Data Analysis and AI
Predictions can be visualized against actual values in a scat-
ter plot, allowing us to evaluate the model’s quality.
8.10.5 Interpretation of Coefficients
Assume the model returns the following coefficients :
Coefficients : [5.2, −10.3, 30.5], Intercept : 50.1
This means that :
— For each **additional day**, sales increase by 5.2 units.
— When the **weather is sunny** (1), sales decrease by
10.3 units.
— When there is a **special occasion** (1), sales increase
by 30.5 units.
— If all variables are zero (e.g., day = 0, weather = 0, oc-
casion = 0), the model predicts sales of 50.1 units.
The following plot illustrates the relationship between the
explanatory variables and the predicted variable (Sales) using
the obtained linear regression function.
Linear Regression: Sales by Day
200
Actual Data
Prediction (Regression)
Predicted Sales
150
100
50
0 1 2 3 4 5 6 7
Day of the Week
8.11. Conclusion : The Profile of a Data Analyst and Tips for
Success 173
8.10.6 Conclusion
Linear regression is a simple but powerful model for predic-
ting continuous values based on multiple explanatory variables.
In our example, it allows us to predict pastry sales based on the
day of the week, weather, and special occasions. However, it is
important to note that this model works well for simple linear
relationships but might not be as effective for more complex
relationships. Other models such as random forests or neural
networks could be explored for more complex data.
8.11 Conclusion : The Profile of a Data Ana-
lyst and Tips for Success
The role of a data analyst occupies a central position in
the world of data analysis, offering strategic solutions to com-
plex problems across various sectors such as finance, healthcare,
marketing, and many others. This role, at the intersection of
technical, analytical, and communication skills, requires a deep
understanding of data and the ability to extract actionable in-
sights.
Profile of a Data Analyst
— Technical Skills : A data analyst is proficient in data
management and processing tools (SQL, Python, R), sta-
tistical techniques, and specialized libraries like Pandas,
NumPy, or ggplot. They are also familiar with visualiza-
tion platforms such as Tableau, Power BI, or Matplotlib
to present clear and impactful results.
— Methodological Knowledge : Predictive analysis, mo-
deling techniques (regression, decision trees, random fo-
rests, neural networks, etc.), and machine learning ap-
proaches are key skills. The data analyst is often respon-
sible for transforming complex models into operational
174 Data Analysis and AI
tools for decision-makers.
— Analytical Thinking and Problem Solving : Cri-
tical analysis is essential for asking the right questions,
evaluating potential biases in the data, and proposing
robust solutions. This also includes managing practical
aspects like identifying missing data, preparing datasets,
and evaluating results.
— Communication : Translating technical results into in-
formation understandable to non-specialists is crucial. A
good data analyst can tell a story with data, using rele-
vant graphs and clear explanations.
— Ethics and Responsibility : Managing biases, data
privacy, and the impact of analysis on decisions are cen-
tral concerns. The data analyst must ensure that their
models are fair, transparent, and aligned with the orga-
nization’s values.
Tips for Succeeding as a Data Analyst
1. Invest in Continuous Education : Tools and tech-
niques evolve quickly. Stay up-to-date by taking advan-
ced courses, exploring new Python libraries, or studying
advances in artificial intelligence.
2. Develop Concrete Projects : The best way to learn
is by practicing. Work on real projects, whether it’s sales
forecasting, trend analysis, or interactive dashboards.
3. Adopt an Interdisciplinary Approach : Combine
your technical skills with an understanding of the applica-
tion domain (healthcare, finance, business, etc.). This will
allow you to better meet the specific needs of decision-
makers.
4. Cultivate Curiosity : Ask questions, explore new me-
thodologies, and try to go beyond simple results to un-
derstand the underlying relationships in the data.
5. Prioritize Integrity : Adhere to ethical standards. Do
8.11. Conclusion : The Profile of a Data Analyst and Tips for
Success 175
not manipulate data to serve an unfounded objective. En-
sure that your analyses are reproducible and justified.
6. Collaborate with Multidisciplinary Teams : A data
analyst rarely works alone. Interact with data scientists,
data engineers, and decision-makers to provide integrated
and impactful solutions.
A Look Towards the Future
In a world where data is ubiquitous, the role of the data ana-
lyst is evolving to address increasingly complex challenges. This
book has explored essential concepts such as data management,
predictive analysis, and advanced modeling techniques, offering
concrete examples to illustrate these ideas.
The future of data analysis belongs to those who, beyond the
numbers, know how to ask the right questions and find answers
that transform the world. Cultivate your curiosity, develop your
skills, and embrace the infinite potential of data to become a
key player in this ever-evolving universe.
176 Data Analysis and AI
Bibliographie
[Barocas and Selbst, 2016] Barocas, S. and Selbst, A. D. (2016).
Fairness in machine learning : Limiting the harms of automa-
ted decision making. Communications of the ACM, 59(9) :29–
37.
[Benoit and Petit, 2020] Benoit, T. and Petit, M. (2020). Pré-
diction du taux de désabonnement dans les entreprises de télé-
communications : une approche basée sur l’analyse prédictive.
Telecommunications Review, 53(1) :56–67.
[Bertin, 2010] Bertin, J. (2010). Semiology of Graphics : Dia-
grams, Networks, Maps. University of Wisconsin Press.
[Binns, 2018] Binns, R. (2018). The ethics of artificial intelli-
gence. The Atlantic.
[Boden, 2016] Boden, M. A. (2016). AI : Its Nature and Future.
Oxford University Press.
[Box, 1976] Box, G. E. P. (1976). Science and statistics. Journal
of the American Statistical Association.
[Breiman, 2001] Breiman, L. (2001). Random forests. Machine
Learning, 45(1) :5–32.
[Brown, 2009] Brown, T. (2009). Change by Design : How De-
sign Thinking Creates New Alternatives for Business and So-
ciety. HarperBusiness.
[Cleveland, 2001] Cleveland, W. S. (2001). Visualizing Data.
Hobart Press.
177
178 Bibliographie
[Cortes and Vapnik, 1995] Cortes, C. and Vapnik, V. (1995).
Support-vector networks. Machine Learning, 20(3) :273–297.
[Davenport, 2006] Davenport, T. H. (2006). Competing on Ana-
lytics : The New Science of Winning. Harvard Business School
Press.
[Davenport and Harris, 2007a] Davenport, T. H. and Harris,
J. G. (2007a). Competing on Analytics : The New Science
of Winning. Harvard Business School Press.
[Davenport and Harris, 2007b] Davenport, T. H. and Harris,
J. G. (2007b). Competing on Analytics : The New Science
of Winning. Harvard Business School Press.
[Dupont and Martin, 2020] Dupont, J. and Martin, C. (2020).
Impact des analyses descriptives sur la gestion des stocks mé-
dicaux pendant la pandémie de covid-19. Journal of Health
Management, 45(3) :234–245.
[Dupuis and Roux, 2022] Dupuis, C. and Roux, M. (2022).
L’impact de l’analyse des données sur l’amélioration des per-
formances organisationnelles : une étude comparative. Data
Science Journal, 16(1) :201–210.
[Durand and Lefevre, 2019] Durand, P. and Lefevre, S. (2019).
Amélioration de la productivité dans le secteur manufacturier
grâce à l’analyse diagnostique. Industrial Engineering Review,
32(4) :100–112.
[et al., 2019] et al., A. E. (2019). A guide to deep learning in
healthcare. Nature Medicine, 25 :24–29.
[et al., 2016] et al., K. N. (2016). Credit risk and fraud detection
using machine learning. ACM Transactions.
[et al., 2005] et al., P. A. L. (2005). Geographical Information
Systems and Science. Wiley.
[Fader and Hardie, 2016] Fader, P. S. and Hardie, B. G. S.
(2016). Predictive analytics for marketing and customer in-
sight. Marketing Science, 35(3) :399–407.
Bibliographie 179
[Fader et al., 2016] Fader, P. S., Hardie, B. G. S., and Lyu, J.
(2016). Customer-base analysis in a discrete-time noncontrac-
tual setting. Marketing Science, 35(2) :226–243.
[Few, 2012] Few, S. (2012). Show Me the Numbers : Designing
Tables and Graphs to Enlighten. Analytics Press.
[Floridi, 2011] Floridi, L. (2011). The Philosophy of Informa-
tion. Oxford University Press, Oxford, UK, 1st edition.
[for Standardization, 2019] for Standardization, I. O. (2019).
Iso/iec 27018 :2019 - information technology - security tech-
niques - code of practice for protection of personally identi-
fiable information (pii) in public clouds acting as pii proces-
sors. ISO.
[Fung, 2011] Fung, K. (2011). Numbers Rule Your World : The
Hidden Influence of Probability and Statistics on Everything
You Do. FT Press.
[Gauthier and Martin, 2020] Gauthier, L. and Martin, C.
(2020). Les algorithmes de recommandation sur netflix : la
science des données à la base de l’expérience utilisateur. Di-
gital Media Insights, 11(1) :45–56.
[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., and Cour-
ville, A. (2016). Deep Learning. MIT Press.
[Greenwade, 1993] Greenwade, G. D. (1993). The Comprehen-
sive Tex Archive Network (CTAN). TUGBoat, 14(3) :342–
351.
[Hand, 2008] Hand, D. J. (2008). Statistics : A Very Short In-
troduction. Oxford University Press, Oxford, UK, 1st edition.
[Harris and Wang, 2020] Harris, J. and Wang, L. (2020). L’uti-
lisation de la science des données pour détecter les fraudes
financières en temps réel. Journal of Financial Crimes,
22(4) :190–202.
[Hinton, 2016] Hinton, G. (2016). AI is the New Electricity.
N/A.
180 Bibliographie
[Hochreiter and Schmidhuber, 1997] Hochreiter, S. and
Schmidhuber, J. (1997). Long short-term memory. Neural
Computation, 9(8) :1735–1780.
[Hosmer et al., 2013] Hosmer, D. W., Lemeshow, S., and Stur-
divant, R. X. (2013). Applied Logistic Regression. Wiley, Ho-
boken, NJ, 3rd edition.
[Hubbard, 2014] Hubbard, D. W. (2014). How to Measure Any-
thing : Finding the Value of Intangibles in Business. Wiley.
[Humby, 2006] Humby, C. (2006). Data is the New Oil. N/A.
[Hyndman and Athanasopoulos, 2008] Hyndman, R. J. and
Athanasopoulos, G. (2008). Forecasting : Principles and Prac-
tice. OTexts. [Link]
[IDEO, 2015] IDEO (2015). The Field Guide to Human-
Centered Design. [Link].
[Knaflic, 2015] Knaflic, C. A. (2015). Storytelling with Data : A
Data Visualization Guide for Business Professionals. Wiley.
[Kraak, 2011] Kraak, M.-J. (2011). Mapping Time : The Next
Step in Visualization. Taylor and Francis.
[Lambert, 2018] Lambert, T. (2018). Détection de fraudes en
temps réel dans les transactions bancaires : une étude sur
les modèles prédictifs. Journal of Financial Technology,
17(2) :88–97.
[LeCun, 2015] LeCun, Y. (2015). Deep learning. Nature,
521(7553) :436–444.
[Lyman and Varian, 2000] Lyman, P. and Varian, H. R.
(2000). How much information ? School of Informa-
tion Management and Systems, UC Berkeley. Disponible
sur https ://[Link]/research/projects/how-
much-info/.
[Marr, 2015] Marr, B. (2015). The Key to Business Success :
Big Data and Business Analytics. Wiley.
Bibliographie 181
[Mayer-Schönberger and Cukier, 2013] Mayer-Schönberger, V.
and Cukier, K. (2013). Big Data : A Revolution That Will
Transform How We Live, Work, and Think. Houghton Mifflin
Harcourt.
[Meyer, 2014] Meyer, J. (2014). The impact of data presentation
on decision-making. Journal of Business Analytics, 32(4) :56–
62.
[Miller and Thompson, 2019] Miller, J. and Thompson, L.
(2019). Prédiction des risques d’inondation à houston : une
application de la science des données pour la gestion des ca-
tastrophes. Environmental Data Science, 13(2) :134–145.
[Min Chen and Liu, 2012] Min Chen, S. M. and Liu, Y. (2012).
Big Data : Related Technologies, Challenges and Future Pros-
pects. Springer.
[Montgomery et al., 2012] Montgomery, D. C., Peck, E. A., and
Vining, G. G. (2012). Introduction to Linear Regression Ana-
lysis. Wiley, Hoboken, NJ, 5th edition.
[Narayanan and Shmatikov, 2008] Narayanan, A. and Shmati-
kov, V. (2008). Robust De-anonymization of Large Sparse
Datasets. IEEE Symposium on Security and Privacy.
[Norman, 2013] Norman, D. (2013). The Design of Everyday
Things : Revised and Expanded Edition. Basic Books.
[O’Neil, 2016] O’Neil, C. (2016). Weapons of Math Destruction :
How Big Data Increases Inequality and Threatens Democracy.
Crown Publishing Group.
[Parliament and of the European Union, 2016] Parliament, E.
and of the European Union, C. (2016). Regulation (eu)
2016/679 of the european parliament and of the council. Of-
ficial Journal of the European Union, pages 1–88.
[Pearl, 2009] Pearl, J. (2009). Causality : Models, Reasoning,
and Inference. Cambridge University Press.
[Pierson, 2015] Pierson, L. (2015). Data Science for Dummies.
Wiley.
182 Bibliographie
[Popper, 1972] Popper, K. (1972). La logique de la découverte
scientifique. Editions Payot.
[Provost and Fawcett, 2013] Provost, F. and Fawcett, T. (2013).
Data Science for Business : What You Need to Know about
Data Mining and Data-Analytic Thinking. O’Reilly Media,
Sebastopol, CA, 1st edition.
[Rosling, 2018] Rosling, H. (2018). Factfulness : Ten Reasons
We’re Wrong About the World – and Why Things Are Better
Than You Think. Flatiron Books.
[Rudin, 2019] Rudin, C. (2019). Stop explaining black box ma-
chine learning models for high stakes decisions and use inter-
pretable models instead. Nature Machine Intelligence, 1 :206–
215.
[Serres, 2008] Serres, M. (2008). La Guerre Mondiale. Editions
Le Pommier, Paris, France.
[Shmueli, 2010] Shmueli, G. (2010). To explain or to predict ?
Statistical Science, 25(3) :289–310.
[Shneiderman, 1996] Shneiderman, B. (1996). The eyes have it :
A task by data type taxonomy for information visualizations.
IEEE Symposium on Visual Languages.
[Silver, 2012] Silver, N. (2012). The Signal and the Noise. Pen-
guin Press.
[Smith and Johnson, 2019] Smith, A. and Johnson, M. (2019).
Optimisation des itinéraires de livraison chez amazon : l’im-
pact de l’analyse prescriptive sur la logistique. Journal of
Logistics and Supply Chain Management, 14(3) :123–134.
[Thomas and Boucher, 2021] Thomas, P. and Boucher, J.
(2021). Recommandations prescriptives pour les traitements
personnalisés dans le domaine de la santé. Healthcare Analy-
tics Review, 8(2) :210–220.
[Thompson, 2020] Thompson, R. (2020). Data transparency
and decision-making in technology companies. Technology
Management Review, 45(2) :91–103.
Bibliographie 183
[TomTom, 2015] TomTom (2015). Visualizing traffic data for
urban planning. Urban Mobility Journal, 12 :45–67.
[Tufte, 1990] Tufte, E. R. (1990). Envisioning Information.
Graphics Press.
[Tufte, 2001] Tufte, E. R. (2001). The Visual Display of Quan-
titative Information. Graphics Press.
[Tukey, 1977] Tukey, J. W. (1977). Exploratory Data Analysis.
Addison-Wesley.