Minor Project Report
Minor Project Report
The COVID-19 pandemic, caused by the novel coronavirus SARS-CoV-2, has rapidly
evolved into one of the most significant global health crises of the 21st century,
profoundly impacting individuals, communities, and nations across the globe. First
identified in December 2019 in Wuhan, China, the virus quickly spread across borders,
leading to widespread transmission and significant morbidity and mortality. The World
Health Organization (WHO) declared COVID-19 a Public Health Emergency of
International Concern in January 2020, followed by a declaration of a pandemic in March
2020, as the virus continued its relentless spread to virtually every corner of the world.
Since then, efforts to contain the virus, mitigate its impact, and develop effective
interventions have been at the forefront of global health agendas, with researchers,
healthcare professionals, policymakers, and communities working tirelessly to confront
this unprecedented challenge.
1
clinical characteristics, public health implications, and socio-economic impacts. By
synthesizing findings from diverse research disciplines, we seek to contribute to a deeper
understanding of the pandemic and inform evidence-based strategies for addressing its
immediate and long-term consequences. We can harness the collective expertise and
resources needed to overcome this unprecedented global health crisis and build a more
resilient future for all through interdisciplinary collaboration and global solidarity.
Abstract
The project presents an in-depth analysis of the COVID-19 pandemic, caused by the
SARS-CoV-2 virus, highlighting its extensive impacts on public health, economies, and
societies worldwide. Epidemiological studies are crucial for understanding the
distribution and determinants of the disease, utilizing data from national health databases
and public health reports to analyze trends, identify hotspots, and explore factors
influencing virus spread through statistical tools and regression analysis. These studies
are vital for identifying high-risk areas and populations for targeted interventions.
Surveys on public health behavior are another key method, gathering data on individual
behaviors, attitudes, and perceptions through online surveys of 1,000 participants.
Descriptive and inferential statistics help summarize the data and explore associations
between demographics and health behaviors, providing insights into public adherence to
guidelines and informing health campaigns and policies.
2
COVID-19's impact. The synthesis of findings across virological, epidemiological,
clinical, and socioeconomic research underscores the multifaceted nature of the
pandemic. Virological studies have detailed the virus's genomic structure and
transmission dynamics, aiding in diagnosis and vaccine development. Clinical research
has clarified COVID-19's clinical spectrum, while public health studies have assessed
intervention effectiveness. Socio-economic research has highlighted the pandemic's
broader impacts, emphasizing the need for sustained research, interdisciplinary
collaboration, and global solidarity to address ongoing challenges and prepare for future
pandemics.
Methodology
1. Study Design
This study employs a retrospective cohort study design to investigate the impact of
COVID-19 on health outcomes and healthcare utilization among individuals with
underlying health conditions. The retrospective design allows for the examination of
historical data from electronic health records (EHRs) to assess outcomes over time.
2. Study Setting
The study is conducted at a tertiary care medical center serving a diverse patient
population in Mexico. The institution has comprehensive EHR systems capturing detailed
patient information, including demographic data, clinical diagnoses, laboratory results,
and treatment records.
3. Study Population
The study population comprises adult patients (aged 18 years and above) with
documented underlying health conditions, including but not limited to hypertension,
3
diabetes, cardiovascular disease, chronic respiratory conditions, and
immuno-compromised states. Patients with confirmed or suspected COVID-19 infection
are identified based on laboratory testing, diagnostic codes, or clinical documentation.
4. Data Collection
Data for the study are extracted from the institution's EHR database. Relevant variables
extracted include demographic characteristics (age, sex), underlying health conditions,
COVID-19 diagnosis (laboratory-confirmed or clinically suspected), disease severity
(mild, moderate, severe), treatment modalities (pharmacological and
non-pharmacological), healthcare utilization (hospital admissions, emergency department
visits, intensive care unit admissions), and clinical outcomes (mortality, morbidity, length
of hospital stay).
5. Data Analysis
Descriptive statistics are used to summarize the demographic and clinical characteristics
of the study population. Bivariate analysis, including chi-square tests and t-tests, is
conducted to compare outcomes between patients with and without COVID-19 infection
and across different subgroups based on disease severity and comorbidity profiles.
Multivariable regression analysis, such as logistic regression or Cox proportional hazards
models is employed to assess the association between COVID-19 infection and health
outcomes while controlling for potential confounding variables (e.g., age, sex,
comorbidities).
6. Ethical Considerations
The study protocol has been reviewed and approved by the Institutional Review Board
(IRB) . Patient confidentiality and privacy are maintained throughout the study, with data
anonymized and stored securely in compliance with relevant regulatory guidelines.
4
Justification
The COVID-19 pandemic has undoubtedly emerged as one of the most challenging
global crises in recent history, impacting virtually every aspect of human life. This project
undertakes a comprehensive analysis of the pandemic, examining its epidemiological,
socio-economic, healthcare, and future outlook dimensions.
At its core, the COVID-19 pandemic is a health crisis characterized by the rapid spread of
a novel coronavirus, SARS-CoV-2. The virus has demonstrated formidable
transmissibility, resulting in a global pandemic within a matter of months. Understanding
the epidemiology of COVID-19 is crucial for grasping the scale and trajectory of the
crisis. This includes studying transmission dynamics, the emergence of variants, and the
disproportionate impact on vulnerable populations such as the elderly and those with
underlying health conditions.
However, the ramifications of COVID-19 extend far beyond the realm of public health.
The pandemic has triggered a cascade of socio-economic consequences, disrupting
economies, livelihoods, and social structures worldwide. Lockdown measures aimed at
curbing the spread of the virus have led to economic recession, widespread job losses,
and exacerbation of existing inequalities. Moreover, disparities in access to healthcare
services have been magnified, with marginalized communities bearing the brunt of the
pandemic's impact.
5
complicated efforts to achieve widespread immunity. On the healthcare front, innovations
in treatment, telemedicine, and healthcare delivery have emerged, reshaping the
landscape of medical practice amid a pandemic.
Looking ahead, the future trajectory of COVID-19 remains uncertain, shaped by factors
such as vaccination coverage, the emergence of new variants, and societal responses.
While vaccination offers hope for controlling the spread of the virus and returning to
pre-pandemic normalcy, ongoing vigilance and preparedness are essential to navigate
potential challenges. Long-term consequences of the pandemic, including its impact on
mental health, education, and global governance, will require sustained attention and
resources.
6
Data Analysis & Visualization
7
According to Fig 1: Heatmap portraying the correlation between various variables,
we get the following analysis:
1. High Correlations:
Some pairs of variables show high positive correlations (values close to 1). For example,
age and the elderly correlate by 1.00, indicating that these two variables are perfectly
correlated. Similarly, ICU and intubated have a high correlation, suggesting that patients
in the ICU are often intubated.
2. Negative Correlations:
Some pairs of variables show high negative correlations (values close to -1). For instance,
sex and pregnancy have a correlation of -1.00, which makes sense biologically as
typically only females can be pregnant.
3. Low or No Correlations:
Many variables show low or near-zero correlations, indicating weak or no linear
relationship between those variables. For instance, sex and age show a correlation of
0.03, suggesting there is no significant linear relationship between these two variables.
4. Distinct Clusters:
Variables such as diabetes, COPD, asthma, hypertension, and other diseases cluster
together, indicating they might be related or occur together frequently in patients.
5. Anomalies:
Some correlations might appear unexpected and could warrant further investigation. For
example, intubation and pneumonia have a correlation of 0.46, which might indicate a
moderate relationship but should be explored further to understand the context better.
8
b. Histogram to show the age distribution
According to Fig 2: Histogram to show age distribution, we get the following analysis:
9
c. Heatmap representing the correlation coefficients between pairs
of variables.
10
d. Count plot generated using the Seaborn library
According to Fig 4: Count plot generated using the Seaborn library visualizes the
number of patients (represented by count on the y-axis) across different medical units
(represented by medical_unit on the x-axis), categorized by death status. The hue
parameter in the plot differentiates the counts by death status: '1' likely represents
survivors and '2' represents deaths.
● Unit 4 has the highest number of deaths: This is evident from the significant height
of the black bar (death = 2) for medical_unit 4.
● Unit 12 has the highest overall patient count: This is indicated by the tallest bars
(both for survivors and deaths) compared to other units.
11
● Other units: Units 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, and 13 have comparatively lower
counts of patients and deaths.
The text below the plot incorrectly states "Unit 4 have the highest percentage of death,"
which should be corrected to reflect that Unit 4 has the highest number of deaths, not
necessarily the highest percentage.
12
e. Bar plot created with the Pandas plotting functionality in
Python, visualizing the number of deaths per month
According to Fig 5: Barplot visualizing the number of deaths per month visualizes
the number of patients (represented by count on the y-axis) across different medical units
(represented by medical_unit on the x-axis), categorized by death status. The hue
parameter in the plot differentiates the counts by death status: '1' likely represents
survivors and '2' represents deaths.
● Highest Deaths in June (Month 6): The month of June shows the highest number of
deaths, reaching close to 25,000.
● Significant Deaths in May and July (Months 5 and 7): Both May and July also show
high death counts, though slightly lower than June, with counts between 20,000 and
25,000.
● Moderate Deaths in April and August (Months 4 and 8): April and August have
13
moderate death counts, ranging between 15,000 and 20,000.
● Low Deaths in Other Months: January, February, March, September, October,
November, and December show significantly lower death counts, all below 5,000.
This plot helps identify the peak months for deaths, which might be useful for
resource planning and understanding seasonal variations in mortality.
14
f. Bar chart to show the percentage of death for each factor
According to Fig 6: Barchart to show the percentage of death for each factor, we get
the following analysis:
● Pregnancy has the lowest percentage of death among the factors listed, indicating
that death incidence in pregnant women due to COVID-19 is low.
● Pneumonia has the highest percentage at 39%, followed by the elderly at 32% and
renal chronic conditions at 30%.
● Other significant factors include COPD (27%), diabetes (23%), cardiovascular
diseases (21%), and hypertension (20%).
● Conditions like immunosuppression and obesity also contribute to the death
percentage but are not as high as the aforementioned factors.
● Asthma and tobacco use have relatively lower percentages of death.
15
g. The histogram illustrates the distribution of ages among
pregnant women in the dataset.
● The chart indicates that the majority of pregnant women in the dataset are between 18
and 35 years old.
● There are few cases of pregnant women above 40, and the frequency drops significantly
for ages beyond 35.
● The dataset contains very few instances of pregnant women aged below 18 and above
35, indicating that pregnancies in these age ranges are relatively rare in the context of
this data.
● The data suggests a concentration of pregnant women within a typical childbearing age
range, aligning with general demographic trends.
16
PYTHON CODES FOR
17
import os
print(os.path.join(dirname, filename))
adress="/kaggle/input/covid19-dataset/Covid Data.csv"
#Importing
import pandas as pd
%matplotlib inline
import plotly.express as px
import datetime
# Load of data
df = pd.read_csv(adress)
18
Exploratory And Cleaning of Data
df.head()
df.isnull().sum().sum()
df.info()
#Parsing of date
df['date_parsed']=pd.to_datetime(df['date_died'][df['date_died']!="9999-99-99"]
, format="%d/%m/%Y")
df['date_died'].value_counts()
df.drop(columns="date_died", inplace=True)
df.nunique()
19
#Create a column for ptients > 65 years and < 65 years
df['elderly'].value_counts()
"clasiffication_final"])]:
if null.any():
f = sns.countplot(x=df[i]);
plt.show()
df[(df['pregnant']==97)].sex.value_counts()
df['pregnant'].replace(97, 2, inplace=True)
plt.figure(figsize=(16,16))
20
#Drope columns which do not have correlation or helpless and contain many nan
values
df.shape
sns.histplot(data=df['age']);
#correlation of data
plt.figure(figsize=(16, 6))
sns.heatmap(df.corr(), annot=True);
sns.countplot(hue=df.death, x= df.medical_unit);
df['month']=df["date_parsed"].dt.month
df.groupby("month")["death"].count().plot(kind="bar");
21
#Function to calculate percentage of death depend on charachteristic factors of
patiens
"""
get total number of patients with some charachreristic and died number of same
charach
"""
total=df[df[col_name]==has].age.count()
return num_died/total*100
t = df[df['death']==1].death.count()/df.shape[0]*100
percen = []
22
"copd", "diabetes", "pneumonia"]
for i in charc_cols:
p = perc_die(i)
percen.append(p)
"""
get total number of patients with some charachreristic and pneumonia patients
number of same charach
"""
total=df[df[col_name]==has].age.count()
23
return num_died/total*100
t = df[df['pneumonia']==1].death.count()/df.shape[0]*100
pnm=[]
for i in charc_cols[:-1]:
pr = perc_pnm(i)
pnm.append(pr)
df.columns
df[(df['death']==2)&(df['pregnant']==1)].age.max()
child=df[(df['age']<16)&
(df["death"]==1)].age.count()/df[df['age']<16].age.count()*100
24
ages=df[df['pregnant']==1].age.unique()
df[df['pregnant']==1].age.hist();
df.columns
features=["inmsupr","hipertension", "elderly","cardiovascular",
"age", "pneumonia"]
y = df.death.iloc[:-10]
x= df.loc[:,features].iloc[:-10]
result_nw_data=df['death'].iloc[-10:]
25
x.head()
#samble size
df.shape
# Resambeling of data
ros = RandomOverSampler(random_state=0)
x_re,y_re = ros.fit_resample(x,y)
sns.countplot(x=y_re);
model = DecisionTreeRegressor(random_state=1)
model.fit(train_x, train_y)
prediction = model.predict(val_x)
prediction
26
#Make dataframe to compare prediction with real data
real_pred={"Real":list(val_y[10:20]),
"Predict": list(prediction[10:20])}
df_real_pred=pd.DataFrame(real_pred)
df_real_pred
#Evaluate model
list(model.predict(new_data))
result_nw_data
27
Limitations
Several limitations are acknowledged in this study, including the retrospective study
design, reliance on secondary data sources, potential for selection bias, and limitations
inherent to observational studies, such as unmeasured confounding and residual bias.
28
● Evolution of the Virus: The ongoing evolution of SARS-CoV-2 and the emergence of
new variants may impact the generalizability and relevance of research findings over
time. Studies conducted early in the pandemic may not capture the dynamics of more
recent viral strains or the effectiveness of vaccines against them.
● Resource Constraints and Ethical Considerations: Resource constraints, including
limited funding, personnel, and infrastructure, may affect the design, conduct, and
interpretation of COVID-19 research. Ethical considerations, such as patient privacy,
informed consent, and equity in access to interventions, may also pose challenges to
study implementation.
● Publication Bias and Preprint Influence: Studies with positive or significant results may
be more likely to be published or disseminated, leading to publication bias and potentially
inflating the perceived effectiveness of interventions or associations. Preprint servers and
rapid dissemination channels may amplify the influence of preliminary findings before
peer review.
29
Conclusion
The COVID-19 pandemic has emerged as a defining global health crisis of the 21st
century, presenting unprecedented challenges to societies, economies, and healthcare
systems worldwide. Through a comprehensive review of the literature, this paper has
synthesized key findings and insights from research across various domains, including
virology, epidemiology, clinical management, public health interventions, and
socio-economic impacts.
30
References
https://www.worldometers.info/coronavirus/
31