0% found this document useful (0 votes)
69 views9 pages

Big Data Analysis of COVID-19 Mortality

The project proposal focuses on analyzing socioeconomic and healthcare factors influencing COVID-19 mortality rates using big data analytics. It aims to explore patterns, identify key variables, and build predictive models to improve public health responses. The study employs various data science techniques and utilizes a comprehensive COVID-19 dataset from Our World in Data to derive insights for better health crisis preparedness.

Uploaded by

m-10498244
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views9 pages

Big Data Analysis of COVID-19 Mortality

The project proposal focuses on analyzing socioeconomic and healthcare factors influencing COVID-19 mortality rates using big data analytics. It aims to explore patterns, identify key variables, and build predictive models to improve public health responses. The study employs various data science techniques and utilizes a comprehensive COVID-19 dataset from Our World in Data to derive insights for better health crisis preparedness.

Uploaded by

m-10498244
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

KULLIYYAH OF INFORMATION AND COMMUNICATION

TECHNOLOGY

SEMESTER 2, 2024/2025

CSCI 4341 BIG DATA ANALYTICS

PROJECT PROPOSAL

PROJECT TITLE :
MULTIVARIATE BIG DATA ANALYSIS OF SOCIOECONOMIC AND
HEALTHCARE FACTORS INFLUENCING COVID-19 MORTALITY RATES

NO. NAME MATRIC NUMBER

1 NURAMIRATUL AISYAH BINTI RUZAIDI 2212736

2 SHARIFAH SYAZWINA BINTI SYED SYAMSULHARIS 2214326

3 ANIS NAZIRA BINTI ABD GHANI 2219732

4 WISSEBO ABDULMAJID 2218587

LECTURER’S NAME : DR. SHARYAR WANI


1. Introduction

The global outbreak of COVID-19 in 2019 challenged health systems, strained


economies, and exposed vulnerabilities in healthcare access and socioeconomic resilience
across countries that were hidden before. Understanding the driving factors behind mortality
rates is very important not only for the evaluation of past responses but also for preparing for
future public health crises. With the wealth of publicly available big data, especially the
comprehensive COVID-19 dataset by Our World in Data (OWID), researchers are now
equipped to explore the correlation of multiple variables at a global scale using advanced
data science techniques.

Data science distinguishes itself from conventional analytics through its emphasis on
algorithmic modelling, machine learning, and predictive inference. While data analytics
typically focuses on summarizing and contextualizing past data, data science allows for
hypothesis testing, forecasting, and the discovery of hidden patterns. According to Johns
Hopkins' Data Science Specialization and BuiltIn's classification, modern data science
encompasses multiple types of analysis beyond descriptive statistics, such as causal
analysis, inferential analysis, diagnostic analysis, predictive modeling, and mechanistic
analysis. These methods will be employed to gather insights into the impact of healthcare
infrastructure and socioeconomic indicators on COVID-19 mortality rates.

This study will try to display the multidisciplinary nature of data science by integrating
statistical learning, data mining, and predictive algorithms to analyse over 160,000 global
observations. Furthermore, the project will introduce innovation by framing contemporary
questions at the intersection of global health and socioeconomics, employing multivariate
and temporal dimensions, and highlighting disparities between countries. All sources
referenced in this project, including the dataset, academic theories, and data science
methodologies, will be cited in IEEE format.

2. Research Questions and Hypotheses

A. Two-variable Data Science Questions

1.​ Causal Analysis:

●​ Question: Does higher GDP per capita directly cause lower COVID-19
mortality per million?
●​ Hypothesis: Countries with higher GDP per capita tend to have significantly
lower COVID-19 death rates.

2.​ Inferential Analysis:

●​ Question: Is the difference in COVID-19 death rates statistically significant


between continents with higher hospital bed capacity vs. those with lower?
●​ Hypothesis: Continents with more hospital beds per thousand show
significantly reduced mortality.
3.​ Predictive Analysis:

●​ Question: Can new deaths be predicted accurately using new case numbers
alone?
●​ Hypothesis: A linear regression model using new case counts will produce a
high predictive power for new deaths.

B. Multivariable Data Science Questions

4.​ Diagnostic Analysis:

●​ Question: Which combination of variables (ICU capacity, age demographics,


poverty level) best explains spikes in mortality during pandemic peaks?
●​ Hypothesis: A mix of inadequate ICU capacity and high aging population
correlates with mortality surges.

5.​ Predictive Analysis:

●​ Question: Can a machine learning model using vaccination rates, testing


rates, stringency index, and health infrastructure predict daily deaths
accurately?
●​ Hypothesis: Ensemble models like Random Forest or XGBoost will yield
>80% accuracy in forecasting death rates.

6.​ Causal Analysis:

●​ Question: Do socioeconomic indicators (HDI, poverty rate, GDP per capita)


causally impact COVID-19 fatality rates?
●​ Hypothesis: Countries with lower socioeconomic scores experience higher
mortality, independent of testing or reporting bias.

7.​ Inferential Analysis:

●​ Question: Is there a statistically significant difference in death rates between


countries with high and low healthcare spending per capita?
●​ Hypothesis: High-spending countries exhibit significantly better survival
outcomes.

8.​ Time Series Forecasting:

●​ Question: How do vaccination trends, policy stringency, and testing rates


affect the trajectory of death rates over time?
●​ Hypothesis: Improvements in vaccination and stringency policies result in
downward mortality trends, with a lag of ~14 days.

9.​ Cluster Analysis:

●​ Question: Can countries be grouped into clusters based on their COVID-19


mortality profile and healthcare infrastructure?
●​ Hypothesis: Clustering will reveal distinct regional risk groups, particularly
separating Global North vs. Global South.
10.​Mechanistic Analysis:

●​ Question: What mechanisms explain why high-income nations with aged


populations still suffered high mortality?
●​ Hypothesis: Delay in lockdowns, inconsistent mask policies, and comorbidity
prevalence explain this paradox.

11.​Exploratory Correlation Mapping:

●​ Question: What are the strongest correlates of total COVID-19 deaths among
10+ candidate variables?
●​ Hypothesis: Age, GDP, ICU beds, and vaccination levels are most correlated
with total deaths.

12.​Dimensionality Reduction + Predictive Modeling:

●​ Question: After applying PCA to reduce the dataset's dimensionality, can we


still achieve high predictive accuracy for death outcomes?
●​ Hypothesis: Principal Components derived from healthcare and
socioeconomic dimensions will retain >90% of predictive signal.

3. Research Objective

This study aims to analyse how socioeconomic and healthcare factors influence
COVID-19 mortality rates using big data techniques. The objectives are:

●​ To explore patterns and key factors related to COVID-19 deaths across different
countries.

●​ To identify combinations of variables like ICU capacity, age and poverty level linked to
high death rates.

●​ To compare COVID-19 death rates between countries with different income levels
and healthcare spending.

●​ To examine whether factors like GDP per capita or Human Development Index (HDI)
have a direct effect on mortality.

●​ To build predictive models that estimate death rates based on case numbers,
vaccination rates, and other indicators.

●​ To provide recommendations based on the findings to help improve public health


responses in future pandemics.

4. Research Significance​
​ This study uses data science methods to understand better the impact of
socioeconomic and healthcare factors on COVID-19 mortality. By analysing big data from
multiple countries, the research provides useful insights for governments and health
organisations to make informed, data-driven decisions.

Through diagnostic and predictive analysis, this study helps identify high-risk
conditions and patterns that contribute to higher death rates. Causal and inferential
techniques offer evidence on how income, healthcare access, and vaccination affect survival
outcomes.

The findings can support better planning for future health crises by helping countries
improve healthcare systems, allocate resources more effectively, and strengthen public
health policies. Overall, this research promotes global health preparedness and smarter
decision-making using data.

5. Literature Review

​ Numerous studies have explored the influence of socioeconomic and healthcare


factors on COVID-19 mortality using a variety of data science approaches. These studies
reveal how variables such as age, race, income level, healthcare capacity, and comorbidities
have significantly impacted COVID-19 outcomes. Techniques ranging from traditional
regression and statistical modeling to advanced machine learning have been employed to
uncover these relationships, offering both explanatory and predictive insights.

​ The table below summarizes the findings from five articles that are closely aligned
with the objectives of this proposal.

No Year Authors Research Techniques Results Future Works


Problem / Used (if any)
Application

1 2020 Li et al. Multivariate Multivariate Higher death Extend


analysis of regression rates analysis to
COVID-19 analysis associated other
case and with higher countries and
death rates in percentage of include more
U.S. counties Black time-depende
population nt variables
and lower
temperatures

2 2022 El Jai et al. Socio- Statistical Identified Suggested


economic modeling and strong links integration of
modeling of data analytics between mobility and
short-term socio- policy data in
COVID-19 economic future studies
trends indicators
and
short-term
case/death
fluctuations

3 2020 Albitar et al. Identifying Meta- Age, Recommend


clinical and analysis & comorbidities targeted
demographic systematic especially interventions
risk factors review diabetes and for high-risk
for COVID-19 hypertension,
mortality and male patients
gender
significantly
increased risk

4 2020 Cao et al. Influence of Spatial Older Suggested


demographic regression population integrating
and analysis and income health system
socioeconomi inequality variables for
c factors on linked to improved
case-fatality higher models
rate (CFR) case-fatality
globally rates

5 2022 Jamshidi et Predicting Machine Developed a Suggested


al. COVID-19 Learning model with integration
mortality Algorithms high accuracy with broader
using for predicting datasets for
machine mortality in validation
learning ICU patients

​ These studies demonstrate a growing interest in combining demographic, clinical,


and socioeconomic data to understand and predict COVID-19 mortality. Our current project
builds on this body of work by applying multivariate and big data analytics to a global
dataset, incorporating a wide range of variables and techniques, including causal inference,
predictive modeling, clustering, and dimensionality reduction, to develop a more holistic
understanding of the factors that influence COVID-19 death rates.

6. Methodology

A.​ Datasets and Tools Description

This dataset is sourced from Kaggle and originally developed by Our World in Data
(OWID) in collaboration with the University of Oxford, making it reliable and widely
accessible resources on the COVID-19 pandemic. They have developed a comprehensive
repository of global datasets focused on major issues affecting humanity. In response to the
COVID-19 pandemic, extensive data has been collected daily from countries and territories
worldwide.s. This dataset serves as a critical resource for researchers, policymakers, and
the general public to make informed decisions.

The dataset provides detailed information on COVID-19 cases, testing,


hospitalizations, vaccinations, and related metrics. It incorporates demographic,
epidemiological, healthcare, and policy-related variables to enable deep analysis of the
pandemic's global impact, underlying risk factors, and the effectiveness of healthcare
responses. It contains 67 attributes and 166,326 observations, covering daily and cumulative
counts of COVID-19 cases and deaths, testing and vaccination rates, hospital and ICU
occupancy, and various governmental policy measures. It supports time-series forecasting,
policy impact studies, and comparative country-level health system performance
evaluations.

For the analysis of the COVID-19 dataset, VS Code with Python language will be
used as primary tools. Python handled data preparation, cleaning, and exploratory data
analysis (EDA). Libraries such as Pandas, NumPy, and Matplotlib supported data
manipulation, computation, and visualization.

B.​ Research Process

1.​ Data Collection and Preprocessing​


In the data collection and preprocessing phase, we will perform data cleaning steps
to handle missing values, remove duplicate rows, and normalize both numerical and
categorical columns. The dataset currently has quite a lot of missing values, with an
overall missing value percentage of 44.54%, which makes cleaning a crucial step
before any further analysis.

2.​ Exploratory Data Analysis​


Exploratory Data Analysis (EDA) is a key step to better understand the distribution
and relationships within the dataset before applying machine learning models or
other analyses. Through EDA, we will identify missing values, detect outliers, and
visualize the data using methods such as correlation analysis, box plots, and scatter
plots. This process will help uncover valuable insights and reveal trends that exist in
the data.

3.​ Machine Learning Algorithms Used​


Various Machine Learning algorithms are planned to be used. Regression analysis
techniques will be employed to predict mortality rates based on multiple influencing
factors. Classification algorithms will be used to categorize countries or regions
based on high or low mortality risk. Clustering methods will be applied to identify
patterns among countries with similar socioeconomic and healthcare profiles.
References

[1] BuiltIn, “Types of Data Analysis: 8 Types & How to Use Them.” [Online]. Available:
[Link]

[2] Our World in Data, “COVID-19 dataset,” 2024. [Online]. Available:


[Link]

[3] “Data Science vs Data Analytics. Medium Article on the Distinctions.” [Online]. Available:
[Link]

[4] C. Bambra, R. Riordan, J. Ford, and F. Matthews, “The COVID-19 pandemic and health
inequalities,” J. Epidemiol. Community Health, vol. 74, no. 11, pp. 964–968, 2020, doi:
10.1136/jech-2020-214401.

[5] A. Y. Li et al., “Multivariate analysis of factors affecting COVID-19 case and death rate in
U.S. counties: The significant effects of Black race and temperature,” medRxiv, 2020, doi:
10.1101/2020.04.17.20069708.

[6] M. El Jai, M. Zhar, D. Ouazar et al., “Socio-economic analysis of short-term trends of


COVID-19: Modeling and data analytics,” BMC Public Health, vol. 22, p. 1633, 2022, doi:
10.1186/s12889-022-13788-4.

[7] O. Albitar, R. Ballouze, J. P. Ooi, and S. M. Sheikh Ghadzi, “Risk factors for mortality
among COVID-19 patients,” Diabetes Res. Clin. Pract., vol. 166, p. 108293, 2020, doi:
10.1016/[Link].2020.108293.

[8] Y. Cao, A. Hiyoshi, and S. Montgomery, “COVID-19 case-fatality rate and demographic
and socioeconomic influencers: Worldwide spatial regression analysis based on
country-level data,” BMJ Open, vol. 10, no. 10, p. e043560, 2020, doi:
10.1136/bmjopen-2020-043560.

[9] E. Jamshidi, A. Asgary, N. Tavakoli, and A. Zali, “Using machine learning to predict
mortality for COVID-19 patients on day 0 in the ICU,” Front. Digit. Health, vol. 3, p. 681608,
2022, doi: 10.3389/fdgth.2021.681608
BuiltIn. (n.d.). Types of Data Analysis: 8 Types & How to Use Them. Retrieved from
[Link]

Our World in Data. (2024). COVID-19 dataset. Retrieved from


[Link]

Data Science vs Data Analytics. (n.d.). Medium Article on the Distinctions. Retrieved from
[Link]

Bambra, C., Riordan, R., Ford, J., & Matthews, F. (2020). The COVID-19 pandemic and
health inequalities. Journal of Epidemiology and Community Health, 74(11), 964–968.
[Link]

Li, A. Y., Hannah, T. C., Durbin, J. R., Dreher, N., McAuley, F. M., Marayati, N. F., Spiera, Z.,
Ali, M., Gometz, A., Kostman, J. T., & Choudhri, T. F. (2020). Multivariate analysis of
factors affecting COVID-19 case and death rate in U.S. counties: The significant effects of
Black race and temperature. medRxiv. [Link]

El Jai, M., Zhar, M., Ouazar, D. et al. Socio-economic analysis of short-term trends of
COVID-19: Modeling and data analytics. BMC Public Health 22, 1633 (2022).
[Link]

Albitar, O., Ballouze, R., Ooi, J. P., & Sheikh Ghadzi, S. M. (2020). Risk factors for mortality
among COVID-19 patients. Diabetes Research and Clinical Practice, 166, 108293.
[Link]

Cao, Y., Hiyoshi, A., & Montgomery, S. (2020). COVID-19 case-fatality rate and demographic
and socioeconomic influencers: Worldwide spatial regression analysis based on
country-level data. BMJ Open, 10(10), e043560.
[Link]

Jamshidi, E., Asgary, A., Tavakoli, N., & Zali, A. (2022). Using machine learning to predict
mortality for COVID-19 patients on day 0 in the ICU. Frontiers in Digital Health, 3,
681608. [Link]

Common questions

Powered by AI

Machine learning is employed to predict COVID-19 mortality using multiple variables like vaccination rates, testing rates, and healthcare infrastructure. The study hypothesizes that ensemble models such as Random Forest or XGBoost can achieve over 80% accuracy in forecasting death rates . Machine learning techniques also aid in time series forecasting and identifying patterns through clustering of countries based on mortality and health infrastructure profiles .

The study proposes that both healthcare infrastructure (such as ICU capacity) and socioeconomic indicators (like GDP per capita and poverty levels) significantly influence COVID-19 mortality rates. It highlights that countries with inadequate healthcare capacity combined with socioeconomic vulnerabilities witness more severe mortality impacts . The research aims to uncover patterns using multivariate big data analysis at a global scale, analyzing over 160,000 observations to understand these impacts .

Multivariate data analysis enables the study to explore the complex, intertwined effects of multiple factors on COVID-19 mortality rates. By considering various variables such as ICU capacity, age demographics, and socioeconomic factors simultaneously, the study uncovers hidden patterns and interdependencies that single-variable analysis might miss. This methodology leverages big data's full potential, providing a comprehensive understanding of how different health and socioeconomic aspects impact mortality .

Time series forecasting is crucial for understanding the dynamic changes in COVID-19 death trends over time, particularly how vaccination rates, policy stringency, and testing affect these trends. The study posits that improvements in vaccinations and policy implementations lead to reduced mortality with a lag period, thereby informing timely public health interventions and policy-making . It helps in observing the trajectory of death rates, revealing how timely actions can alleviate health outcomes .

Clustering analysis groups countries into clusters based on similarities in their COVID-19 mortality profiles and healthcare infrastructure. This method can reveal distinct regional risk patterns and disparities, particularly highlighting differences between the Global North and South. Understanding these clusters helps identify high-risk areas and tailor health interventions more effectively . It provides insights into regional vulnerabilities and capacities for handling pandemic challenges .

Exploratory correlation mapping helps identify the strongest correlates of COVID-19 mortality among numerous candidates, such as age, GDP, ICU bed availability, and vaccination rates. By mapping these correlations, this approach highlights which factors are most strongly linked to high death rates, thus informing policymakers on which areas require the most attention to mitigate future pandemic impacts . It helps detect impactful variables, guiding resource allocation and policy adjustments .

Dimensionality reduction through techniques like PCA is expected to simplify the dataset while retaining the majority of the predictive signals related to COVID-19 mortality outcomes. This reduction is anticipated to maintain over 90% of the predictive accuracy by focusing on principal components derived from healthcare and socioeconomic dimensions. This facilitates efficient processing and modeling, aiding in building robust predictive models . It streamlines complex data analysis while preserving essential information .

Predictive models in this research aim to estimate mortality rates based on diverse factors like case numbers, vaccination rates, and socioeconomic indicators. These models are used to forecast death rates and identify high-risk conditions and demographic trends, aiding in timely and effective public health responses. They also serve to validate hypotheses about factors influencing mortality, ultimately guiding recommendations for future health crises planning . The objective is to derive actionable insights for improving response strategies .

Causal and inferential analyses are employed to discern and quantify the impact of socioeconomic indicators like Human Development Index, poverty rates, and GDP per capita on COVID-19 mortality. The study hypothesizes that these factors can causally affect mortality rates, showing higher mortality in countries with lower socioeconomic scores, independent of testing or reporting biases . Inferential analysis can identify significant differences in mortality linked to healthcare spending disparities, thus providing data-driven evidence for policy changes .

The study integrates a range of data science methodologies including causal inference, predictive modeling, clustering, and dimensionality reduction to build a comprehensive understanding of COVID-19 mortality factors. By combining these methods, the research aims to provide a holistic analysis, from identifying causal relations to predicting outcomes and clustering regions for targeted interventions. This multifaceted approach leverages big data's capabilities to gain deeper insights into the pandemic's complexities . The integration ensures thoroughness and precision in addressing mortality challenges .

You might also like