KULLIYYAH OF INFORMATION AND COMMUNICATION
TECHNOLOGY
SEMESTER 2, 2024/2025
CSCI 4341 BIG DATA ANALYTICS
PROJECT PROPOSAL
PROJECT TITLE :
MULTIVARIATE BIG DATA ANALYSIS OF SOCIOECONOMIC AND
HEALTHCARE FACTORS INFLUENCING COVID-19 MORTALITY RATES
NO. NAME MATRIC NUMBER
1 NURAMIRATUL AISYAH BINTI RUZAIDI 2212736
2 SHARIFAH SYAZWINA BINTI SYED SYAMSULHARIS 2214326
3 ANIS NAZIRA BINTI ABD GHANI 2219732
4 WISSEBO ABDULMAJID 2218587
LECTURER’S NAME : DR. SHARYAR WANI
1. Introduction
The global outbreak of COVID-19 in 2019 challenged health systems, strained
economies, and exposed vulnerabilities in healthcare access and socioeconomic resilience
across countries that were hidden before. Understanding the driving factors behind mortality
rates is very important not only for the evaluation of past responses but also for preparing for
future public health crises. With the wealth of publicly available big data, especially the
comprehensive COVID-19 dataset by Our World in Data (OWID), researchers are now
equipped to explore the correlation of multiple variables at a global scale using advanced
data science techniques.
Data science distinguishes itself from conventional analytics through its emphasis on
algorithmic modelling, machine learning, and predictive inference. While data analytics
typically focuses on summarizing and contextualizing past data, data science allows for
hypothesis testing, forecasting, and the discovery of hidden patterns. According to Johns
Hopkins' Data Science Specialization and BuiltIn's classification, modern data science
encompasses multiple types of analysis beyond descriptive statistics, such as causal
analysis, inferential analysis, diagnostic analysis, predictive modeling, and mechanistic
analysis. These methods will be employed to gather insights into the impact of healthcare
infrastructure and socioeconomic indicators on COVID-19 mortality rates.
This study will try to display the multidisciplinary nature of data science by integrating
statistical learning, data mining, and predictive algorithms to analyse over 160,000 global
observations. Furthermore, the project will introduce innovation by framing contemporary
questions at the intersection of global health and socioeconomics, employing multivariate
and temporal dimensions, and highlighting disparities between countries. All sources
referenced in this project, including the dataset, academic theories, and data science
methodologies, will be cited in IEEE format.
2. Research Questions and Hypotheses
A. Two-variable Data Science Questions
1. Causal Analysis:
● Question: Does higher GDP per capita directly cause lower COVID-19
mortality per million?
● Hypothesis: Countries with higher GDP per capita tend to have significantly
lower COVID-19 death rates.
2. Inferential Analysis:
● Question: Is the difference in COVID-19 death rates statistically significant
between continents with higher hospital bed capacity vs. those with lower?
● Hypothesis: Continents with more hospital beds per thousand show
significantly reduced mortality.
3. Predictive Analysis:
● Question: Can new deaths be predicted accurately using new case numbers
alone?
● Hypothesis: A linear regression model using new case counts will produce a
high predictive power for new deaths.
B. Multivariable Data Science Questions
4. Diagnostic Analysis:
● Question: Which combination of variables (ICU capacity, age demographics,
poverty level) best explains spikes in mortality during pandemic peaks?
● Hypothesis: A mix of inadequate ICU capacity and high aging population
correlates with mortality surges.
5. Predictive Analysis:
● Question: Can a machine learning model using vaccination rates, testing
rates, stringency index, and health infrastructure predict daily deaths
accurately?
● Hypothesis: Ensemble models like Random Forest or XGBoost will yield
>80% accuracy in forecasting death rates.
6. Causal Analysis:
● Question: Do socioeconomic indicators (HDI, poverty rate, GDP per capita)
causally impact COVID-19 fatality rates?
● Hypothesis: Countries with lower socioeconomic scores experience higher
mortality, independent of testing or reporting bias.
7. Inferential Analysis:
● Question: Is there a statistically significant difference in death rates between
countries with high and low healthcare spending per capita?
● Hypothesis: High-spending countries exhibit significantly better survival
outcomes.
8. Time Series Forecasting:
● Question: How do vaccination trends, policy stringency, and testing rates
affect the trajectory of death rates over time?
● Hypothesis: Improvements in vaccination and stringency policies result in
downward mortality trends, with a lag of ~14 days.
9. Cluster Analysis:
● Question: Can countries be grouped into clusters based on their COVID-19
mortality profile and healthcare infrastructure?
● Hypothesis: Clustering will reveal distinct regional risk groups, particularly
separating Global North vs. Global South.
10.Mechanistic Analysis:
● Question: What mechanisms explain why high-income nations with aged
populations still suffered high mortality?
● Hypothesis: Delay in lockdowns, inconsistent mask policies, and comorbidity
prevalence explain this paradox.
11.Exploratory Correlation Mapping:
● Question: What are the strongest correlates of total COVID-19 deaths among
10+ candidate variables?
● Hypothesis: Age, GDP, ICU beds, and vaccination levels are most correlated
with total deaths.
12.Dimensionality Reduction + Predictive Modeling:
● Question: After applying PCA to reduce the dataset's dimensionality, can we
still achieve high predictive accuracy for death outcomes?
● Hypothesis: Principal Components derived from healthcare and
socioeconomic dimensions will retain >90% of predictive signal.
3. Research Objective
This study aims to analyse how socioeconomic and healthcare factors influence
COVID-19 mortality rates using big data techniques. The objectives are:
● To explore patterns and key factors related to COVID-19 deaths across different
countries.
● To identify combinations of variables like ICU capacity, age and poverty level linked to
high death rates.
● To compare COVID-19 death rates between countries with different income levels
and healthcare spending.
● To examine whether factors like GDP per capita or Human Development Index (HDI)
have a direct effect on mortality.
● To build predictive models that estimate death rates based on case numbers,
vaccination rates, and other indicators.
● To provide recommendations based on the findings to help improve public health
responses in future pandemics.
4. Research Significance
This study uses data science methods to understand better the impact of
socioeconomic and healthcare factors on COVID-19 mortality. By analysing big data from
multiple countries, the research provides useful insights for governments and health
organisations to make informed, data-driven decisions.
Through diagnostic and predictive analysis, this study helps identify high-risk
conditions and patterns that contribute to higher death rates. Causal and inferential
techniques offer evidence on how income, healthcare access, and vaccination affect survival
outcomes.
The findings can support better planning for future health crises by helping countries
improve healthcare systems, allocate resources more effectively, and strengthen public
health policies. Overall, this research promotes global health preparedness and smarter
decision-making using data.
5. Literature Review
Numerous studies have explored the influence of socioeconomic and healthcare
factors on COVID-19 mortality using a variety of data science approaches. These studies
reveal how variables such as age, race, income level, healthcare capacity, and comorbidities
have significantly impacted COVID-19 outcomes. Techniques ranging from traditional
regression and statistical modeling to advanced machine learning have been employed to
uncover these relationships, offering both explanatory and predictive insights.
The table below summarizes the findings from five articles that are closely aligned
with the objectives of this proposal.
No Year Authors Research Techniques Results Future Works
Problem / Used (if any)
Application
1 2020 Li et al. Multivariate Multivariate Higher death Extend
analysis of regression rates analysis to
COVID-19 analysis associated other
case and with higher countries and
death rates in percentage of include more
U.S. counties Black time-depende
population nt variables
and lower
temperatures
2 2022 El Jai et al. Socio- Statistical Identified Suggested
economic modeling and strong links integration of
modeling of data analytics between mobility and
short-term socio- policy data in
COVID-19 economic future studies
trends indicators
and
short-term
case/death
fluctuations
3 2020 Albitar et al. Identifying Meta- Age, Recommend
clinical and analysis & comorbidities targeted
demographic systematic especially interventions
risk factors review diabetes and for high-risk
for COVID-19 hypertension,
mortality and male patients
gender
significantly
increased risk
4 2020 Cao et al. Influence of Spatial Older Suggested
demographic regression population integrating
and analysis and income health system
socioeconomi inequality variables for
c factors on linked to improved
case-fatality higher models
rate (CFR) case-fatality
globally rates
5 2022 Jamshidi et Predicting Machine Developed a Suggested
al. COVID-19 Learning model with integration
mortality Algorithms high accuracy with broader
using for predicting datasets for
machine mortality in validation
learning ICU patients
These studies demonstrate a growing interest in combining demographic, clinical,
and socioeconomic data to understand and predict COVID-19 mortality. Our current project
builds on this body of work by applying multivariate and big data analytics to a global
dataset, incorporating a wide range of variables and techniques, including causal inference,
predictive modeling, clustering, and dimensionality reduction, to develop a more holistic
understanding of the factors that influence COVID-19 death rates.
6. Methodology
A. Datasets and Tools Description
This dataset is sourced from Kaggle and originally developed by Our World in Data
(OWID) in collaboration with the University of Oxford, making it reliable and widely
accessible resources on the COVID-19 pandemic. They have developed a comprehensive
repository of global datasets focused on major issues affecting humanity. In response to the
COVID-19 pandemic, extensive data has been collected daily from countries and territories
worldwide.s. This dataset serves as a critical resource for researchers, policymakers, and
the general public to make informed decisions.
The dataset provides detailed information on COVID-19 cases, testing,
hospitalizations, vaccinations, and related metrics. It incorporates demographic,
epidemiological, healthcare, and policy-related variables to enable deep analysis of the
pandemic's global impact, underlying risk factors, and the effectiveness of healthcare
responses. It contains 67 attributes and 166,326 observations, covering daily and cumulative
counts of COVID-19 cases and deaths, testing and vaccination rates, hospital and ICU
occupancy, and various governmental policy measures. It supports time-series forecasting,
policy impact studies, and comparative country-level health system performance
evaluations.
For the analysis of the COVID-19 dataset, VS Code with Python language will be
used as primary tools. Python handled data preparation, cleaning, and exploratory data
analysis (EDA). Libraries such as Pandas, NumPy, and Matplotlib supported data
manipulation, computation, and visualization.
B. Research Process
1. Data Collection and Preprocessing
In the data collection and preprocessing phase, we will perform data cleaning steps
to handle missing values, remove duplicate rows, and normalize both numerical and
categorical columns. The dataset currently has quite a lot of missing values, with an
overall missing value percentage of 44.54%, which makes cleaning a crucial step
before any further analysis.
2. Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a key step to better understand the distribution
and relationships within the dataset before applying machine learning models or
other analyses. Through EDA, we will identify missing values, detect outliers, and
visualize the data using methods such as correlation analysis, box plots, and scatter
plots. This process will help uncover valuable insights and reveal trends that exist in
the data.
3. Machine Learning Algorithms Used
Various Machine Learning algorithms are planned to be used. Regression analysis
techniques will be employed to predict mortality rates based on multiple influencing
factors. Classification algorithms will be used to categorize countries or regions
based on high or low mortality risk. Clustering methods will be applied to identify
patterns among countries with similar socioeconomic and healthcare profiles.
References
[1] BuiltIn, “Types of Data Analysis: 8 Types & How to Use Them.” [Online]. Available:
[Link]
[2] Our World in Data, “COVID-19 dataset,” 2024. [Online]. Available:
[Link]
[3] “Data Science vs Data Analytics. Medium Article on the Distinctions.” [Online]. Available:
[Link]
[4] C. Bambra, R. Riordan, J. Ford, and F. Matthews, “The COVID-19 pandemic and health
inequalities,” J. Epidemiol. Community Health, vol. 74, no. 11, pp. 964–968, 2020, doi:
10.1136/jech-2020-214401.
[5] A. Y. Li et al., “Multivariate analysis of factors affecting COVID-19 case and death rate in
U.S. counties: The significant effects of Black race and temperature,” medRxiv, 2020, doi:
10.1101/2020.04.17.20069708.
[6] M. El Jai, M. Zhar, D. Ouazar et al., “Socio-economic analysis of short-term trends of
COVID-19: Modeling and data analytics,” BMC Public Health, vol. 22, p. 1633, 2022, doi:
10.1186/s12889-022-13788-4.
[7] O. Albitar, R. Ballouze, J. P. Ooi, and S. M. Sheikh Ghadzi, “Risk factors for mortality
among COVID-19 patients,” Diabetes Res. Clin. Pract., vol. 166, p. 108293, 2020, doi:
10.1016/[Link].2020.108293.
[8] Y. Cao, A. Hiyoshi, and S. Montgomery, “COVID-19 case-fatality rate and demographic
and socioeconomic influencers: Worldwide spatial regression analysis based on
country-level data,” BMJ Open, vol. 10, no. 10, p. e043560, 2020, doi:
10.1136/bmjopen-2020-043560.
[9] E. Jamshidi, A. Asgary, N. Tavakoli, and A. Zali, “Using machine learning to predict
mortality for COVID-19 patients on day 0 in the ICU,” Front. Digit. Health, vol. 3, p. 681608,
2022, doi: 10.3389/fdgth.2021.681608
BuiltIn. (n.d.). Types of Data Analysis: 8 Types & How to Use Them. Retrieved from
[Link]
Our World in Data. (2024). COVID-19 dataset. Retrieved from
[Link]
Data Science vs Data Analytics. (n.d.). Medium Article on the Distinctions. Retrieved from
[Link]
Bambra, C., Riordan, R., Ford, J., & Matthews, F. (2020). The COVID-19 pandemic and
health inequalities. Journal of Epidemiology and Community Health, 74(11), 964–968.
[Link]
Li, A. Y., Hannah, T. C., Durbin, J. R., Dreher, N., McAuley, F. M., Marayati, N. F., Spiera, Z.,
Ali, M., Gometz, A., Kostman, J. T., & Choudhri, T. F. (2020). Multivariate analysis of
factors affecting COVID-19 case and death rate in U.S. counties: The significant effects of
Black race and temperature. medRxiv. [Link]
El Jai, M., Zhar, M., Ouazar, D. et al. Socio-economic analysis of short-term trends of
COVID-19: Modeling and data analytics. BMC Public Health 22, 1633 (2022).
[Link]
Albitar, O., Ballouze, R., Ooi, J. P., & Sheikh Ghadzi, S. M. (2020). Risk factors for mortality
among COVID-19 patients. Diabetes Research and Clinical Practice, 166, 108293.
[Link]
Cao, Y., Hiyoshi, A., & Montgomery, S. (2020). COVID-19 case-fatality rate and demographic
and socioeconomic influencers: Worldwide spatial regression analysis based on
country-level data. BMJ Open, 10(10), e043560.
[Link]
Jamshidi, E., Asgary, A., Tavakoli, N., & Zali, A. (2022). Using machine learning to predict
mortality for COVID-19 patients on day 0 in the ICU. Frontiers in Digital Health, 3,
681608. [Link]